docs: Add observability PRD and sync pipeline explorer visualization

- prd-observability.md: Product requirements document for the sync pipeline
  observability system, covering structured logging, metrics collection,
  sync run tracking, and robot-mode performance output
- gitlore-sync-explorer.html: Self-contained interactive HTML visualization
  for exploring sync pipeline stage timings and data flow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
teernisse
2026-02-04 10:01:28 -05:00
parent 86a51cddef
commit 38da7ca47b
2 changed files with 1710 additions and 0 deletions

866
docs/prd-observability.md Normal file
View File

@@ -0,0 +1,866 @@
# PRD: Observability Infrastructure for lore CLI
**Status:** Draft
**Author:** Taylor + Claude
**Date:** 2026-02-04
---
## 1. Problem Statement
lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond `RUST_LOG`, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available.
### Current State
| Capability | Status |
|---|---|
| Log destination | stderr only, ephemeral |
| Log persistence | None |
| Structured output | Human-readable fmt only |
| Verbosity control | `RUST_LOG` env var (no CLI flag) |
| Performance metrics | Ad-hoc `Instant::now()` in 2 commands |
| Timing in robot JSON | `elapsed_ms` in search and sync `meta` only |
| Spans / correlation | None |
| Log rotation | None |
| Per-stage timing | None |
| Rate limit / retry visibility | `tracing::warn!` only |
| Error aggregation | None |
| Historical comparison | None |
### What's Already in Place (to build on)
- `tracing` (0.1) + `tracing-subscriber` (0.3) with `env-filter` feature
- Registry-based subscriber initialized in `src/main.rs:44-58` with a single `fmt::layer()` using `SuspendingWriter`
- `SuspendingWriter` (`src/cli/progress.rs:25-73`) that coordinates log output with indicatif `MultiProgress` — buffers each log line, calls `MULTI.suspend()` on drop to clear progress bars before writing to stderr
- `IngestDisplay` struct (`src/cli/commands/ingest.rs:65-104`) controlling UI verbosity with three modes: `interactive()` / `silent()` / `progress_only()`
- Robot mode JSON envelope: `{ "ok": true, "data": {...}, "meta": {...} }` — used consistently in sync, search, sync-status, and doctor commands
- XDG-compliant data directory at `~/.local/share/lore/`
- `sync_runs` table (migration 001) with schema: `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json`**exists but is never written to** (no INSERT anywhere in the codebase; `sync_status.rs` reads from it but always gets zero rows)
- `uuid` crate (v1, v4 feature) already a dependency
- Structured fields used in tracing calls (e.g., `info!(owner = %self.owner, ...)`)
- `EnvFilter` currently hardcoded: `lore=info` + `warn` default directives
- Global CLI flags in `src/cli/mod.rs:9-43`: `--config`, `--robot`, `-J`, `--color`, `--quiet` (all `global = true`)
---
## 2. Goals
### Primary
1. **Post-mortem debugging**: Any failed or slow run can be diagnosed after the fact from persistent, structured log files.
2. **Performance visibility**: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage.
3. **Ergonomic verbosity**: Users and agents control log verbosity through CLI flags (`-v`, `-vv`, `-vvv`) without needing to know `RUST_LOG` syntax.
4. **Machine-parseable logs**: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.).
5. **Agent-friendly metrics**: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring.
### Secondary
6. **Log rotation and retention**: Log files don't grow unbounded; old logs are automatically cleaned up.
7. **Correlation IDs**: Every sync run gets a unique ID that connects log lines, database records, and robot output.
8. **Rate limit and retry transparency**: Every rate-limited request and retry is visible in logs with full context.
9. **Sync history with metrics**: The `sync_runs` table is enriched with per-stage timing, item counts, and error counts so `lore sync-status` becomes a real dashboard.
### Non-Goals
- External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1.
- Real-time log streaming / tailing UI.
- Alerting or notification systems.
- Distributed tracing across multiple lore instances.
---
## 3. Research Foundation
### 3.1 The Three Pillars of Observability
Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars:
1. **Logs** -- Discrete events with context. The foundation.
2. **Metrics** -- Numerical measurements over time (counters, gauges, histograms).
3. **Traces** -- Causally ordered spans representing operations.
For a CLI tool (not a long-running service), the mapping is:
| Pillar | CLI Equivalent |
|---|---|
| Logs | Structured log files per invocation |
| Metrics | Per-stage timing, item counts, error counts stored in DB |
| Traces | Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions) |
### 3.2 Structured Logging Best Practices
From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice:
- **Always structured**: JSON or key=value, never free-form prose in production logs.
- **Contextual fields propagate**: A sync_run_id set at the top level appears in every downstream log line.
- **Levels have semantic meaning**:
- `ERROR`: Operation failed, requires attention.
- `WARN`: Degraded behavior (rate limited, retry, skip).
- `INFO`: Significant state transitions (stage start/complete, items processed).
- `DEBUG`: Detailed operational data (page boundaries, individual API calls).
- `TRACE`: Wire-level detail (request/response bodies, SQL queries).
### 3.3 CLI Verbosity Conventions
From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo):
| Pattern | Meaning | Precedent |
|---|---|---|
| (default) | INFO for app, WARN for deps | cargo, rustc |
| `-q` / `--quiet` | Suppress non-error output | ripgrep, fd, cargo |
| `-v` | DEBUG for app | ripgrep, fd |
| `-vv` | DEBUG for app + deps | cargo |
| `-vvv` | TRACE for everything | cargo, curl |
| `RUST_LOG=...` | Fine-grained override | Universal in Rust |
The `-v` flag should feel familiar to anyone who has used cargo, curl, or ssh.
### 3.4 Log File Rotation
`tracing-appender` (from the tokio-rs/tracing ecosystem) provides:
- **Daily rotation**: New file per day, named `lore.2026-02-04.log`.
- **Non-blocking writes**: Dedicated writer thread, zero impact on main async runtime.
- **Configurable retention**: Delete files older than N days.
This is the canonical solution in the Rust tracing ecosystem and requires no custom code.
### 3.5 Performance Metrics for CLI Tools
Inspired by hyperfine's approach to benchmarking and cargo's `--timings` flag:
- Report wall-clock time per stage.
- Report item throughput (items/sec).
- Store historical runs for trend comparison.
- Present timing data in both human-readable and machine-readable formats.
---
## 4. Design
### 4.1 Architecture Overview
```
CLI Invocation
|
+----------+----------+
| |
Interactive Mode Robot Mode
| |
+---stderr (human fmt) stdout (JSON envelope)
| | |
| progress bars { ok, data, meta: {
| colored output elapsed_ms,
| stages: [...],
| run_id
| }}
|
Log Subscribers (layered)
|
+----+----+--------+
| | |
stderr file (future:
(fmt) (JSON) OTLP)
```
### 4.2 Subscriber Stack
Replace the current single-layer subscriber with a layered registry. Each layer has its own filter:
```
registry()
.with(stderr_layer.with_filter(stderr_filter)) // Human-readable, SuspendingWriter, -v controlled
.with(file_layer.with_filter(file_filter)) // JSON, daily rotation, always DEBUG+
```
**stderr layer**: Same `fmt::layer()` as today with `SuspendingWriter`, but level controlled by `-v` flags. When `--log-format json` is passed, this layer switches to `fmt::layer().json()` (same JSON format as file layer, but still routed through `SuspendingWriter` for progress bar coordination).
**file layer**: Always-on JSON output to `~/.local/share/lore/logs/`, daily rotation via `tracing-appender`. Uses its own `EnvFilter` set to `lore=debug,warn` regardless of `-v` flags, ensuring post-mortem data is always available. The file layer does NOT use `SuspendingWriter` — it writes to a file, not stderr, so progress bar coordination is unnecessary.
**Filter architecture**: Per-layer filtering (not a single shared `EnvFilter`) is required because the file layer must always be at DEBUG+ while stderr follows `-v`. `tracing-subscriber`'s `Layer::with_filter()` method enables this.
**`RUST_LOG` override**: When `RUST_LOG` is set, it overrides BOTH layer filters. This is the expert escape hatch.
**Current subscriber** (`src/main.rs:44-58`):
```rust
tracing_subscriber::registry()
.with(
tracing_subscriber::fmt::layer()
.with_target(false)
.with_writer(lore::cli::progress::SuspendingWriter),
)
.with(
EnvFilter::from_default_env()
.add_directive("lore=info".parse().unwrap())
.add_directive("warn".parse().unwrap()),
)
.init();
```
This will be replaced by the dual-layer setup. The `SuspendingWriter` integration and `with_target(false)` on the stderr layer remain unchanged.
### 4.3 Verbosity Levels
#### stderr layer (controlled by `-v` flags)
| Flags | App Level | Dep Level | Behavior |
|---|---|---|---|
| (none) | INFO | WARN | Default. Stage transitions, summaries. |
| `-q` | WARN | ERROR | Errors and warnings only. |
| `-v` | DEBUG | WARN | Detailed app behavior. API pages, skip reasons. |
| `-vv` | DEBUG | INFO | App + dependency detail. HTTP client, SQLite. |
| `-vvv` | TRACE | DEBUG | Everything. Wire-level detail. |
| `RUST_LOG=...` | (overrides all) | (overrides all) | Expert escape hatch. |
Precedence: `RUST_LOG` > `-v` flags > defaults. This matches cargo's behavior.
#### file layer (independent of `-v` flags)
| Condition | App Level | Dep Level |
|---|---|---|
| Always (default) | DEBUG | WARN |
| `RUST_LOG=...` set | (overrides) | (overrides) |
The file layer always captures DEBUG+ for the `lore` crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. `RUST_LOG` overrides both layers when set.
#### New CLI flags
Add to the `Cli` struct (`src/cli/mod.rs`):
```rust
/// Increase log verbosity (-v, -vv, -vvv)
#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)]
pub verbose: u8,
/// Log format for stderr output: text (default) or json
#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")]
pub log_format: String,
```
The `-v` flag uses `clap::ArgAction::Count` to support `-v`, `-vv`, `-vvv` as a single flag with increasing count. The `--log-format` flag controls whether stderr emits human-readable or JSON-formatted log lines.
### 4.4 Structured Log File Output
**Location**: `~/.local/share/lore/logs/lore.YYYY-MM-DD.log`
**Format**: One JSON object per line (JSONL), produced by `tracing-subscriber`'s `fmt::layer().json()`:
```json
{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}}
```
**Rotation**: Daily via `tracing-appender::rolling::daily()`.
**Retention**: Configurable, default 30 days. A `logs.retention_days` config field. Cleanup runs at startup (check directory, delete files older than N days).
### 4.5 Tracing Spans
Introduce spans for causal correlation within a single invocation:
```
sync (run_id=uuid)
+-- ingest_issues
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
| +-- fetch_resource_events (project="group/repo")
+-- ingest_mrs
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
+-- generate_docs
+-- embed
```
Each span records `elapsed_ms` on close. The `run_id` propagates to all child spans and log events, enabling `jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log` to extract an entire run.
### 4.6 Performance Metrics
#### 4.6.1 Per-Stage Timing
Every command collects a `Vec<StageTiming>`:
```rust
#[derive(Debug, Clone, Serialize)]
pub struct StageTiming {
pub name: String, // "ingest_issues", "fetch_pages", etc.
#[serde(skip_serializing_if = "Option::is_none")]
pub project: Option<String>, // Which project, if applicable
pub elapsed_ms: u64,
pub items_processed: usize,
#[serde(skip_serializing_if = "is_zero")]
pub items_skipped: usize,
#[serde(skip_serializing_if = "is_zero")]
pub errors: usize,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub sub_stages: Vec<StageTiming>, // Nested child stages
}
```
**Collection mechanism**: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds `#[instrument]` spans to each sync stage. Phase 3 adds a custom `tracing-subscriber` layer that records span enter/exit times and structured fields, then extracts the span tree into `Vec<StageTiming>` when the root span closes.
This means:
- No mutable timing collector threaded through `run_ingest``fetch_pages``sync_discussions`
- Spans are the single source of truth for timing
- `StageTiming` is a materialized view of the span tree
- The custom layer implements `on_close` to capture `elapsed` and `on_record` to capture structured fields like `items_processed`
**Where to define**: `src/core/metrics.rs` (new file — genuinely new functionality that doesn't fit in any existing file)
#### 4.6.2 Robot JSON Meta Enhancement
Currently:
```json
{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } }
```
Proposed:
```json
{
"ok": true,
"data": { ... },
"meta": {
"run_id": "a1b2c3d4",
"elapsed_ms": 45230,
"stages": [
{
"name": "ingest_issues",
"elapsed_ms": 12340,
"items_processed": 150,
"items_skipped": 30,
"errors": 0,
"sub_stages": [
{ "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 },
{ "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 }
]
},
{
"name": "ingest_mrs",
"elapsed_ms": 18900,
"items_processed": 85,
"items_skipped": 12,
"errors": 1
},
{ "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 },
{ "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 }
]
}
}
```
#### 4.6.3 Sync History Enrichment
**Prerequisite bug fix**: The `sync_runs` table (migration 001) exists with columns `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — but **no code ever writes to it**. The `sync_status.rs` command reads from it but always gets zero rows. This must be fixed before enrichment.
**Step 1: Wire up sync_runs lifecycle** (prerequisite, in Phase 4)
Add INSERT/UPDATE calls to the sync and ingest command handlers:
```rust
// On sync/ingest start:
INSERT INTO sync_runs (started_at, heartbeat_at, status, command)
VALUES (?now_ms, ?now_ms, 'running', ?command_name)
RETURNING id;
// On sync/ingest success:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics
WHERE id = ?run_id;
// On sync/ingest failure:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics
WHERE id = ?run_id;
```
**Where**: Add a `SyncRunRecorder` helper in `src/core/db.rs` or `src/core/sync_run.rs` that encapsulates the INSERT/UPDATE lifecycle. Called from `run_sync()` in `src/cli/commands/sync.rs` and `run_ingest()` in `src/cli/commands/ingest.rs`.
**Step 2: Schema migration** (migration 014)
Add dedicated queryable columns alongside the existing `metrics_json`:
```sql
-- Migration 014: sync_runs enrichment for observability
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
-- Index for correlation queries
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
```
The existing `metrics_json` column stores the detailed `Vec<StageTiming>` as a JSON array. No need for a separate `stages_json` column.
**Step 3: Enhanced sync-status display**
`lore sync-status` (`src/cli/commands/sync_status.rs`) currently shows only the last run. Enhance to show recent runs with metrics:
```
Recent sync runs:
Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error
Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors
Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors
```
Robot mode (`lore --robot sync-status`):
```json
{
"ok": true,
"data": {
"runs": [
{
"run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"started_at": "2026-02-04T14:32:01.123Z",
"elapsed_ms": 45230,
"status": "succeeded",
"command": "sync",
"total_items_processed": 235,
"total_errors": 1,
"stages": [...]
}
],
"cursors": [...],
"summary": {...}
}
}
```
The `stages` array is parsed from `metrics_json` and included in the robot output. Interactive mode shows the summary table above; `lore --robot sync-status --run a1b2c3` shows a single run's full stage breakdown.
#### 4.6.4 Human-Readable Timing
At the end of `lore sync` (interactive mode), print a timing summary:
```
Sync complete in 45.2s
Ingest issues .... 12.3s (150 items, 42 discussions)
Ingest MRs ....... 18.9s (85 items, 1 error)
Generate docs .... 8.5s (235 documents)
Embed ............ 5.5s (1024 chunks)
```
Gated behind `display.show_text` so it doesn't appear in progress_only or silent modes.
### 4.7 Rate Limit and Retry Transparency
Currently, rate limits emit a `tracing::warn!`. Enhance to:
- Log at INFO level (not just WARN) with structured fields: `info!(path, attempt, retry_after_secs, "Rate limited, retrying")`.
- Count total rate-limit hits per run and include in stage timing.
- In `-v` mode, show retry progress on stderr: ` Retrying /api/v4/projects/123/issues (429, waiting 2s)`.
### 4.8 Configuration
Add a new `logging` section to `Config` (`src/core/config.rs`):
```rust
#[derive(Debug, Clone, Deserialize)]
pub struct LoggingConfig {
/// Directory for log files. Default: ~/.local/share/lore/logs/
#[serde(default)]
pub log_dir: Option<String>,
/// Days to retain log files. Default: 30. Set to 0 to disable file logging.
#[serde(default = "default_retention_days")]
pub retention_days: u32,
/// Enable JSON log files. Default: true.
#[serde(default = "default_true")]
pub file_logging: bool,
}
fn default_retention_days() -> u32 { 30 }
fn default_true() -> bool { true }
```
Add to the `Config` struct:
```rust
#[serde(default)]
pub logging: LoggingConfig,
```
With `config.json`:
```json
{
"logging": {
"log_dir": null,
"retention_days": 30,
"file_logging": true
}
}
```
Defaults are sane so existing configs continue working with zero changes.
**CLI flags** (added to `Cli` struct in `src/cli/mod.rs`):
| Flag | Type | Default | Description |
|---|---|---|---|
| `-v` / `--verbose` | count (u8) | 0 | Increase stderr log verbosity. Stacks: `-v`, `-vv`, `-vvv`. |
| `--log-format` | text \| json | text | Stderr log format. `json` emits one JSON object per log line (same schema as file layer). |
These are global flags (`global = true`) consistent with the existing `--quiet`, `--robot`, etc.
---
## 5. Implementation Plan
### Phase 1: Verbosity Flags + Structured File Logging
**Scope**: CLI flags, dual-layer subscriber, file logging, rotation, retention, `--log-format`.
**Files touched**:
- `Cargo.toml` — add `tracing-appender` dependency
- `src/cli/mod.rs` — add `-v`/`--verbose` (count) and `--log-format` flags to `Cli` struct
- `src/main.rs` — replace subscriber initialization (lines 44-58) with dual-layer setup
- `src/core/config.rs` — add `LoggingConfig` struct and `logging` field to `Config`
- `src/core/paths.rs` — add `get_log_dir()` helper (XDG data dir + `/logs/`)
- `src/cli/commands/doctor.rs` — add log file location and disk usage check
**Implementation steps**:
1. Add `-v` / `--verbose` (count, `u8`) and `--log-format` (text|json) flags to `Cli` struct.
2. Add `tracing-appender` dependency to `Cargo.toml`.
3. Add `LoggingConfig` to `Config` with `#[serde(default)]`.
4. Add `get_log_dir()` to `src/core/paths.rs` (mirrors `get_db_path()` pattern).
5. Replace subscriber init in `main.rs`:
- Build `stderr_filter` from `-v` count (or `RUST_LOG` if set).
- Build `file_filter` as `lore=debug,warn` (or `RUST_LOG` if set).
- stderr layer: `fmt::layer().with_writer(SuspendingWriter)` with `stderr_filter`. When `--log-format json`, chain `.json()`.
- file layer: `fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore"))` with `file_filter`.
- Combine via `registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter))`.
6. Implement log retention at startup: scan `log_dir`, delete files matching `lore.*.log` pattern older than `retention_days`. Run before subscriber init so deleted files aren't held open.
7. Add log file check to `lore doctor`: report log directory path, number of log files, total disk usage. In robot mode, add a `logging` field to `DoctorChecks` with `log_dir`, `file_count`, `total_bytes`, `oldest_file`.
**New dependencies**: `tracing-appender` (0.2)
**Interaction with `-q`/`--quiet`**: The existing `--quiet` flag suppresses non-error terminal output via `IngestDisplay::silent()`. It should NOT affect file logging (file layer is always on). When `-q` and `-v` are both passed, `-q` wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+.
**Tests** (see Section 6.1 for details):
- Unit: `EnvFilter` construction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE)
- Unit: `RUST_LOG` overrides `-v` flags
- Unit: `-q` + `-v` interaction (quiet wins)
- Unit: `LoggingConfig` deserialization with missing/partial/full fields
- Unit: Log retention deletes old files, preserves recent ones
- Integration: Subscriber produces JSON lines to a test file
- Integration: `SuspendingWriter` still works with dual-layer stack (no garbled output)
### Phase 2: Spans + Correlation IDs
**Scope**: Tracing spans, UUID-based `run_id`, span recording for JSON logs.
**Depends on**: Phase 1 (subscriber must support span recording).
**Files touched**:
- `src/cli/commands/sync.rs` — add root span with `run_id` field to `run_sync()`
- `src/cli/commands/ingest.rs` — add `#[instrument]` spans to `run_ingest()` and its stages
- `src/ingestion/orchestrator.rs` — add spans for `fetch_pages`, `sync_discussions`, `fetch_resource_events`
- `src/documents/regenerator.rs` — add span for `generate_docs` stage
- `src/embedding/pipeline.rs` — add span for `embed` stage
- `src/main.rs` — generate `run_id` before calling command handler, pass as field
**Implementation steps**:
1. Generate `run_id` using `Uuid::new_v4().to_string()[..8]` (first 8 chars of UUIDv4) at command entry in `main.rs`. No new dependency needed — `uuid` v1 with v4 feature is already in `Cargo.toml`.
2. Create root span: `let _root = tracing::info_span!("sync", run_id = %run_id).entered();` (or equivalent for each command).
3. Add `#[instrument(skip_all, fields(stage = "ingest_issues"))]` to ingest stages.
4. Add `#[instrument(skip_all, fields(project = %project_path))]` to per-project functions.
5. Ensure the file layer's JSON formatter includes span context. `tracing-subscriber`'s `fmt::layer().json()` includes the current span chain by default when the registry has span storage enabled.
6. Verify: parse a log file, confirm every line includes `span.run_id`.
**New dependencies**: None (`uuid` already present).
**Tests**:
- Unit: `run_id` is a valid 8-character hex string
- Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains `run_id` in span context
- Integration: Nested spans produce correct parent-child relationships in JSON output
### Phase 3: Performance Metrics Collection
**Scope**: `StageTiming` struct, span-to-metrics extraction, robot JSON enrichment, timing summary.
**Depends on**: Phase 2 (spans must exist to extract timing from).
**Files touched**:
- `src/core/metrics.rs` — new file: `StageTiming` struct, `MetricsLayer` (custom tracing layer), span-to-timing extraction
- `src/cli/commands/sync.rs` — consume `Vec<StageTiming>` from `MetricsLayer`, include in `SyncMeta`
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
- `src/main.rs` — register `MetricsLayer` in the subscriber stack
**Implementation steps**:
1. Define `StageTiming` struct with `sub_stages: Vec<StageTiming>` in `src/core/metrics.rs`.
2. Implement `MetricsLayer` as a custom `tracing_subscriber::Layer`:
- `on_new_span`: Record span ID, name, parent, start time.
- `on_record`: Capture structured fields (`items_processed`, `items_skipped`, `errors`) recorded via `Span::record()`.
- `on_close`: Calculate `elapsed_ms`, build `StageTiming` entry, attach to parent.
- Provide `fn extract_timings(&self, run_id: &str) -> Vec<StageTiming>` to materialize the span tree after the root span closes.
3. Store `MetricsLayer` reference (behind `Arc`) so command handlers can call `extract_timings()` after `run_sync()` completes.
4. Extend `SyncMeta` and `SyncJsonOutput` to include `run_id: String` and `stages: Vec<StageTiming>`.
5. Print human-readable timing summary at end of interactive sync (gated behind `IngestDisplay::show_text`).
**Span field recording**: Sync stages must record item counts as span fields for `MetricsLayer` to capture:
```rust
let span = tracing::info_span!("ingest_issues");
let _guard = span.enter();
// ... do work ...
span.record("items_processed", count);
span.record("items_skipped", skipped);
```
**Tests**:
- Unit: `StageTiming` serialization matches expected JSON (including nested `sub_stages`)
- Unit: `MetricsLayer` correctly builds span tree from synthetic span events
- Unit: `MetricsLayer` handles spans with no children (leaf stages like `embed`)
- Unit: `MetricsLayer` handles concurrent spans (multiple projects in parallel)
- Integration: `lore --robot sync` output includes `meta.stages` array with correct nesting
- Integration: Interactive sync prints timing summary table to stderr
### Phase 4: Sync History Enrichment
**Scope**: Wire up `sync_runs` INSERT/UPDATE lifecycle, schema migration, enhanced sync-status.
**Depends on**: Phase 3 (needs `Vec<StageTiming>` to store in `metrics_json`).
**Files touched**:
- `migrations/014_sync_runs_enrichment.sql` — new migration: add `run_id`, `total_items_processed`, `total_errors` columns + index
- `src/core/sync_run.rs` — new file: `SyncRunRecorder` struct encapsulating INSERT on start, UPDATE on finish
- `src/cli/commands/sync.rs` — create `SyncRunRecorder` before pipeline, finalize after
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
- `src/cli/commands/sync_status.rs` — enhance to show recent runs with metrics, parse `metrics_json`
**Implementation steps**:
1. Create migration `014_sync_runs_enrichment.sql`:
```sql
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
```
Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly.
2. Implement `SyncRunRecorder`:
```rust
pub struct SyncRunRecorder { id: i64, conn: Connection }
impl SyncRunRecorder {
pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result<Self>;
pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>;
pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>;
}
```
3. In `run_sync()`: create `SyncRunRecorder::start()` before pipeline, call `.succeed()` or `.fail()` after.
4. In `run_ingest()`: same pattern.
5. Enhance `sync_status.rs`:
- Query last N runs (default 10) instead of just the last 1.
- Parse `metrics_json` column to extract stage breakdown.
- Show `run_id`, duration, item counts, error counts in both interactive and robot modes.
- Add `--run <run_id>` flag to `sync-status` for single-run detail view.
**Tests**:
- Unit: `SyncRunRecorder::start` inserts a row with status='running'
- Unit: `SyncRunRecorder::succeed` updates status, sets finished_at, writes metrics_json
- Unit: `SyncRunRecorder::fail` updates status, sets error, sets finished_at
- Unit: Migration 014 applies cleanly on top of migration 013
- Integration: `lore sync` creates a sync_runs row; `lore sync-status` displays it
- Integration: `lore --robot sync-status` JSON includes `runs` array with stage breakdowns
- Integration: Failed sync records error in sync_runs with partial metrics
### Phase 5: Rate Limit + Retry Instrumentation
**Scope**: Enhanced logging in GitLab client, retry counters in stage timing.
**Depends on**: Phase 2 (spans for context), Phase 3 (StageTiming for counters).
**Files touched**:
- `src/gitlab/client.rs` (or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log events
- `src/core/metrics.rs` — add `rate_limit_hits` and `retries` fields to `StageTiming`
**Implementation steps**:
1. Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields:
```rust
info!(
path = %request_path,
attempt = attempt_number,
retry_after_secs = retry_after,
status_code = 429,
"Rate limited, retrying"
);
```
2. Add `rate_limit_hits: usize` and `retries: usize` fields to `StageTiming` (with `#[serde(skip_serializing_if = "is_zero")]`).
3. In `MetricsLayer`, count rate-limit and retry events within each span and include in `StageTiming`.
4. In `-v` mode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1.
**Tests**:
- Unit: Rate-limit log events include all required structured fields
- Unit: `StageTiming` serialization includes `rate_limit_hits` and `retries` when non-zero, omits when zero
- Integration: Simulate 429 response, verify log line has `path`, `attempt`, `retry_after_secs` fields
- Integration: After simulated retries, `StageTiming` counts match expected values
---
## 6. Acceptance Criteria
### 6.1 Phase 1: Verbosity Flags + Structured File Logging
**Functional criteria**:
- [ ] `lore sync` writes JSON log lines to `~/.local/share/lore/logs/lore.YYYY-MM-DD.log` with zero configuration.
- [ ] `lore -v sync` shows DEBUG-level `lore::*` output on stderr; dependency output stays at WARN.
- [ ] `lore -vv sync` shows DEBUG-level `lore::*` + INFO-level dependency output on stderr.
- [ ] `lore -vvv sync` shows TRACE-level output for everything on stderr.
- [ ] `RUST_LOG=lore::gitlab=trace lore sync` overrides `-v` flags for both stderr and file layers.
- [ ] `lore --log-format json sync` emits JSON-formatted log lines on stderr (same schema as file layer).
- [ ] Log files rotate daily (new file per calendar day).
- [ ] Files matching `lore.*.log` older than `retention_days` are deleted on startup.
- [ ] Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format).
- [ ] `--quiet` suppresses non-error stderr output. `-q` + `-v` together: `-q` wins (stderr at WARN+).
- [ ] `--quiet` does NOT affect file logging (file layer remains at DEBUG+).
- [ ] `lore doctor` reports: log directory path, number of log files, total disk usage in bytes. Robot mode includes a `logging` field in the checks JSON.
- [ ] File layer always logs at DEBUG+ for `lore::*` crate regardless of `-v` flags.
**Test specifications**:
- `test_verbosity_filter_construction`: Given verbosity count 0/1/2/3, assert the resulting `EnvFilter` matches the expected directives table.
- `test_rust_log_overrides_verbose`: Set `RUST_LOG=lore=trace`, pass `-v` (count=1), assert the filter uses TRACE (not DEBUG).
- `test_quiet_overrides_verbose`: Pass `-q` and `-v` together, assert stderr filter is WARN+.
- `test_logging_config_defaults`: Deserialize an empty `{}` JSON as `LoggingConfig`, assert `retention_days=30`, `file_logging=true`, `log_dir=None`.
- `test_logging_config_partial`: Deserialize `{"retention_days": 7}`, assert `file_logging=true` default preserved.
- `test_log_retention_cleanup`: Create temp dir with files named `lore.2026-01-01.log` through `lore.2026-02-04.log`. Run retention with `retention_days=7`. Assert files older than 7 days are deleted, recent files preserved.
- `test_log_retention_ignores_non_log_files`: Create temp dir with `lore.2026-01-01.log` and `other.txt`. Run retention. Assert `other.txt` is NOT deleted.
- `test_json_log_output_format`: Capture file layer output, parse each line as JSON, assert keys: `timestamp`, `level`, `target`, `fields`, `span`.
- `test_suspending_writer_dual_layer`: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines).
### 6.2 Phase 2: Spans + Correlation IDs
**Functional criteria**:
- [ ] Every log line within a sync run includes `run_id` in the JSON span context.
- [ ] `jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.log` extracts all lines from a run.
- [ ] Nested spans produce a chain: log lines inside `fetch_pages` include both the `fetch_pages` span and the parent `ingest_issues` span in their span context.
- [ ] `run_id` is an 8-character hex string (truncated UUIDv4).
- [ ] Spans are visible in `-vv` stderr output as bracketed context.
**Test specifications**:
- `test_run_id_format`: Generate 100 run_ids, assert each is 8 chars, all hex characters.
- `test_run_id_uniqueness`: Generate 1000 run_ids, assert no duplicates.
- `test_span_context_in_json_logs`: Run a mock sync with spans, capture JSON log output, parse and verify each line has `spans` array containing `run_id`.
- `test_nested_span_chain`: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names.
- `test_span_elapsed_on_close`: Create a span, sleep 10ms, close it. Verify the close event records `elapsed_ms >= 10`.
### 6.3 Phase 3: Performance Metrics Collection
**Functional criteria**:
- [ ] `lore --robot sync` JSON includes `meta.run_id` (string) and `meta.stages` (array).
- [ ] Each stage in `meta.stages` has: `name`, `elapsed_ms`, `items_processed`.
- [ ] Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have `sub_stages` arrays.
- [ ] Sub-stages include `project` field when applicable.
- [ ] `lore sync` (interactive) prints a timing summary table on stderr, gated behind `IngestDisplay::show_text`.
- [ ] `lore -q sync` does NOT print the timing summary.
- [ ] Zero-value fields (`items_skipped: 0`, `errors: 0`) are omitted from JSON output.
**Test specifications**:
- `test_stage_timing_serialization`: Create a `StageTiming` with sub_stages, serialize to JSON, assert structure matches PRD example.
- `test_stage_timing_zero_fields_omitted`: Create `StageTiming` with `errors: 0`, serialize, assert no `errors` key in output.
- `test_metrics_layer_single_span`: Create `MetricsLayer`, enter/exit one span with recorded fields, extract timings, assert one `StageTiming` entry.
- `test_metrics_layer_nested_spans`: Create parent + child spans, extract timings, assert parent has child in `sub_stages`.
- `test_metrics_layer_parallel_spans`: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent.
- `test_sync_meta_includes_stages`: Mock a sync pipeline, verify robot JSON output parses correctly with `meta.stages`.
- `test_timing_summary_format`: Capture stderr during interactive sync, verify timing table format matches PRD example.
### 6.4 Phase 4: Sync History Enrichment
**Functional criteria**:
- [ ] `lore sync` creates a row in `sync_runs` with status='running' at start, updated to 'succeeded'/'failed' at finish.
- [ ] `lore ingest issues` also creates a `sync_runs` row.
- [ ] `sync_runs.run_id` matches the `run_id` in log files and robot JSON.
- [ ] `sync_runs.metrics_json` contains the serialized `Vec<StageTiming>`.
- [ ] `sync_runs.total_items_processed` and `total_errors` are populated.
- [ ] `lore sync-status` shows the last 10 runs with: run_id, timestamp, duration, item count, error count.
- [ ] `lore --robot sync-status` JSON includes `runs` array with `stages` parsed from `metrics_json`.
- [ ] Failed syncs record the error message and any partial metrics collected before failure.
- [ ] Migration 014 applies cleanly and is idempotent (safe to re-run).
**Test specifications**:
- `test_sync_run_recorder_start`: Call `start()`, query sync_runs, assert one row with status='running'.
- `test_sync_run_recorder_succeed`: Call `start()` then `succeed()`, assert row has status='succeeded', finished_at set, metrics_json parseable.
- `test_sync_run_recorder_fail`: Call `start()` then `fail()`, assert row has status='failed', error set.
- `test_sync_run_recorder_fail_with_partial_metrics`: Call `start()`, collect some metrics, then `fail()`. Assert metrics_json contains partial data.
- `test_migration_014_applies`: Apply all migrations 001-014 on a fresh DB. Assert `sync_runs` has `run_id`, `total_items_processed`, `total_errors` columns.
- `test_migration_014_idempotent`: Apply migration 014 twice. Assert no error on second apply.
- `test_sync_status_shows_runs`: Insert 3 sync_runs rows, run `print_sync_status()`, assert output includes all 3 with correct formatting.
- `test_sync_status_json_includes_stages`: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assert `runs[0].stages` is an array.
### 6.5 Phase 5: Rate Limit + Retry Instrumentation
**Functional criteria**:
- [ ] Rate-limit events (HTTP 429) log at INFO with structured fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
- [ ] Retry events (non-429 transient errors) log with: `path`, `attempt`, `error`.
- [ ] `StageTiming` includes `rate_limit_hits` and `retries` counts (omitted when zero).
- [ ] `lore -v sync` shows retry activity on stderr (visible because it's INFO+).
- [ ] Rate limit counts are included in `metrics_json` stored in `sync_runs`.
**Test specifications**:
- `test_rate_limit_log_fields`: Simulate a 429 response, capture log output, parse JSON, assert fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
- `test_retry_log_fields`: Simulate a transient error + retry, capture log, assert fields: `path`, `attempt`, `error`.
- `test_stage_timing_rate_limit_counts`: Simulate 3 rate-limit hits within a span, extract `StageTiming`, assert `rate_limit_hits == 3`.
- `test_stage_timing_retry_counts`: Simulate 2 retries, extract `StageTiming`, assert `retries == 2`.
- `test_rate_limit_fields_omitted_when_zero`: Create `StageTiming` with zero rate limits, serialize, assert no `rate_limit_hits` key.
---
## 7. Resolved Decisions
1. **Log format**: Use `tracing-subscriber`'s built-in JSON formatter (`fmt::layer().json()`). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter.
2. **Span recording**: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit.
3. **Log file location**: `~/.local/share/lore/logs/` (XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache.
4. **Retention**: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching `lore.*.log` older than `retention_days` (default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held.
5. **Stage timing granularity**: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The `StageTiming` struct includes an optional `project` field and a `sub_stages: Vec<StageTiming>` field for nesting.
6. **Stage timing collection mechanism**: Materialized from tracing spans, not plumbed manually. A custom `MetricsLayer` in the subscriber stack records span enter/exit/record events and builds the `StageTiming` tree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them.
7. **run_id format**: First 8 characters of `Uuid::new_v4().to_string()` (e.g., `"a1b2c3d4"`). The `uuid` crate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations.
8. **File log level**: Always DEBUG+ for `lore::*` crate, WARN+ for dependencies, regardless of `-v` flags. This ensures post-mortem data is always richer than what was shown on stderr. `RUST_LOG` overrides both layers when set.
9. **sync_runs lifecycle**: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing `metrics_json` column stores the detailed `Vec<StageTiming>` array — no need for a separate `stages_json` column.
10. **JSON stderr via --log-format**: A `--log-format text|json` global flag controls stderr log format. Default is `text` (human-readable). When `json`, stderr uses the same JSON formatter as the file layer, routed through `SuspendingWriter` for progress bar coordination. This enables `lore sync 2>&1 | jq` workflows without reading log files.
---
## 8. Phase Dependency Graph
```
Phase 1 (Subscriber + Flags)
|
v
Phase 2 (Spans + run_id)
|
+------+------+
| |
v v
Phase 3 Phase 5
(Metrics) (Rate Limit Logging)
| |
v |
Phase 4 |
(Sync History) <--+
```
**Parallelization opportunities**:
- Phase 1 must complete before anything else.
- Phase 2 must complete before Phase 3 or Phase 5.
- Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3).
- Phase 4 depends on Phase 3 (needs `Vec<StageTiming>` to store). Phase 5's `rate_limit_hits`/`retries` fields on `StageTiming` can be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially.
**Agent assignment suggestion**:
- Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure)
- Agent B: Phase 3 (after Phase 2 completes)
- Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3)
- Agent B or D: Phase 4 (after Phase 3 completes)
---
## 9. References
- Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489.
- Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951.
- tokio-rs/tracing ecosystem: `tracing`, `tracing-subscriber`, `tracing-appender`.
- GNU Coding Standards: Verbosity and diagnostic output conventions.
Rust CLI Working Group: Recommendations for error reporting and verbosity.