docs: Add observability PRD and sync pipeline explorer visualization
- prd-observability.md: Product requirements document for the sync pipeline observability system, covering structured logging, metrics collection, sync run tracking, and robot-mode performance output - gitlore-sync-explorer.html: Self-contained interactive HTML visualization for exploring sync pipeline stage timings and data flow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
866
docs/prd-observability.md
Normal file
866
docs/prd-observability.md
Normal file
@@ -0,0 +1,866 @@
|
||||
# PRD: Observability Infrastructure for lore CLI
|
||||
|
||||
**Status:** Draft
|
||||
**Author:** Taylor + Claude
|
||||
**Date:** 2026-02-04
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond `RUST_LOG`, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available.
|
||||
|
||||
### Current State
|
||||
|
||||
| Capability | Status |
|
||||
|---|---|
|
||||
| Log destination | stderr only, ephemeral |
|
||||
| Log persistence | None |
|
||||
| Structured output | Human-readable fmt only |
|
||||
| Verbosity control | `RUST_LOG` env var (no CLI flag) |
|
||||
| Performance metrics | Ad-hoc `Instant::now()` in 2 commands |
|
||||
| Timing in robot JSON | `elapsed_ms` in search and sync `meta` only |
|
||||
| Spans / correlation | None |
|
||||
| Log rotation | None |
|
||||
| Per-stage timing | None |
|
||||
| Rate limit / retry visibility | `tracing::warn!` only |
|
||||
| Error aggregation | None |
|
||||
| Historical comparison | None |
|
||||
|
||||
### What's Already in Place (to build on)
|
||||
|
||||
- `tracing` (0.1) + `tracing-subscriber` (0.3) with `env-filter` feature
|
||||
- Registry-based subscriber initialized in `src/main.rs:44-58` with a single `fmt::layer()` using `SuspendingWriter`
|
||||
- `SuspendingWriter` (`src/cli/progress.rs:25-73`) that coordinates log output with indicatif `MultiProgress` — buffers each log line, calls `MULTI.suspend()` on drop to clear progress bars before writing to stderr
|
||||
- `IngestDisplay` struct (`src/cli/commands/ingest.rs:65-104`) controlling UI verbosity with three modes: `interactive()` / `silent()` / `progress_only()`
|
||||
- Robot mode JSON envelope: `{ "ok": true, "data": {...}, "meta": {...} }` — used consistently in sync, search, sync-status, and doctor commands
|
||||
- XDG-compliant data directory at `~/.local/share/lore/`
|
||||
- `sync_runs` table (migration 001) with schema: `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — **exists but is never written to** (no INSERT anywhere in the codebase; `sync_status.rs` reads from it but always gets zero rows)
|
||||
- `uuid` crate (v1, v4 feature) already a dependency
|
||||
- Structured fields used in tracing calls (e.g., `info!(owner = %self.owner, ...)`)
|
||||
- `EnvFilter` currently hardcoded: `lore=info` + `warn` default directives
|
||||
- Global CLI flags in `src/cli/mod.rs:9-43`: `--config`, `--robot`, `-J`, `--color`, `--quiet` (all `global = true`)
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals
|
||||
|
||||
### Primary
|
||||
|
||||
1. **Post-mortem debugging**: Any failed or slow run can be diagnosed after the fact from persistent, structured log files.
|
||||
2. **Performance visibility**: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage.
|
||||
3. **Ergonomic verbosity**: Users and agents control log verbosity through CLI flags (`-v`, `-vv`, `-vvv`) without needing to know `RUST_LOG` syntax.
|
||||
4. **Machine-parseable logs**: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.).
|
||||
5. **Agent-friendly metrics**: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring.
|
||||
|
||||
### Secondary
|
||||
|
||||
6. **Log rotation and retention**: Log files don't grow unbounded; old logs are automatically cleaned up.
|
||||
7. **Correlation IDs**: Every sync run gets a unique ID that connects log lines, database records, and robot output.
|
||||
8. **Rate limit and retry transparency**: Every rate-limited request and retry is visible in logs with full context.
|
||||
9. **Sync history with metrics**: The `sync_runs` table is enriched with per-stage timing, item counts, and error counts so `lore sync-status` becomes a real dashboard.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1.
|
||||
- Real-time log streaming / tailing UI.
|
||||
- Alerting or notification systems.
|
||||
- Distributed tracing across multiple lore instances.
|
||||
|
||||
---
|
||||
|
||||
## 3. Research Foundation
|
||||
|
||||
### 3.1 The Three Pillars of Observability
|
||||
|
||||
Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars:
|
||||
|
||||
1. **Logs** -- Discrete events with context. The foundation.
|
||||
2. **Metrics** -- Numerical measurements over time (counters, gauges, histograms).
|
||||
3. **Traces** -- Causally ordered spans representing operations.
|
||||
|
||||
For a CLI tool (not a long-running service), the mapping is:
|
||||
|
||||
| Pillar | CLI Equivalent |
|
||||
|---|---|
|
||||
| Logs | Structured log files per invocation |
|
||||
| Metrics | Per-stage timing, item counts, error counts stored in DB |
|
||||
| Traces | Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions) |
|
||||
|
||||
### 3.2 Structured Logging Best Practices
|
||||
|
||||
From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice:
|
||||
|
||||
- **Always structured**: JSON or key=value, never free-form prose in production logs.
|
||||
- **Contextual fields propagate**: A sync_run_id set at the top level appears in every downstream log line.
|
||||
- **Levels have semantic meaning**:
|
||||
- `ERROR`: Operation failed, requires attention.
|
||||
- `WARN`: Degraded behavior (rate limited, retry, skip).
|
||||
- `INFO`: Significant state transitions (stage start/complete, items processed).
|
||||
- `DEBUG`: Detailed operational data (page boundaries, individual API calls).
|
||||
- `TRACE`: Wire-level detail (request/response bodies, SQL queries).
|
||||
|
||||
### 3.3 CLI Verbosity Conventions
|
||||
|
||||
From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo):
|
||||
|
||||
| Pattern | Meaning | Precedent |
|
||||
|---|---|---|
|
||||
| (default) | INFO for app, WARN for deps | cargo, rustc |
|
||||
| `-q` / `--quiet` | Suppress non-error output | ripgrep, fd, cargo |
|
||||
| `-v` | DEBUG for app | ripgrep, fd |
|
||||
| `-vv` | DEBUG for app + deps | cargo |
|
||||
| `-vvv` | TRACE for everything | cargo, curl |
|
||||
| `RUST_LOG=...` | Fine-grained override | Universal in Rust |
|
||||
|
||||
The `-v` flag should feel familiar to anyone who has used cargo, curl, or ssh.
|
||||
|
||||
### 3.4 Log File Rotation
|
||||
|
||||
`tracing-appender` (from the tokio-rs/tracing ecosystem) provides:
|
||||
|
||||
- **Daily rotation**: New file per day, named `lore.2026-02-04.log`.
|
||||
- **Non-blocking writes**: Dedicated writer thread, zero impact on main async runtime.
|
||||
- **Configurable retention**: Delete files older than N days.
|
||||
|
||||
This is the canonical solution in the Rust tracing ecosystem and requires no custom code.
|
||||
|
||||
### 3.5 Performance Metrics for CLI Tools
|
||||
|
||||
Inspired by hyperfine's approach to benchmarking and cargo's `--timings` flag:
|
||||
|
||||
- Report wall-clock time per stage.
|
||||
- Report item throughput (items/sec).
|
||||
- Store historical runs for trend comparison.
|
||||
- Present timing data in both human-readable and machine-readable formats.
|
||||
|
||||
---
|
||||
|
||||
## 4. Design
|
||||
|
||||
### 4.1 Architecture Overview
|
||||
|
||||
```
|
||||
CLI Invocation
|
||||
|
|
||||
+----------+----------+
|
||||
| |
|
||||
Interactive Mode Robot Mode
|
||||
| |
|
||||
+---stderr (human fmt) stdout (JSON envelope)
|
||||
| | |
|
||||
| progress bars { ok, data, meta: {
|
||||
| colored output elapsed_ms,
|
||||
| stages: [...],
|
||||
| run_id
|
||||
| }}
|
||||
|
|
||||
Log Subscribers (layered)
|
||||
|
|
||||
+----+----+--------+
|
||||
| | |
|
||||
stderr file (future:
|
||||
(fmt) (JSON) OTLP)
|
||||
```
|
||||
|
||||
### 4.2 Subscriber Stack
|
||||
|
||||
Replace the current single-layer subscriber with a layered registry. Each layer has its own filter:
|
||||
|
||||
```
|
||||
registry()
|
||||
.with(stderr_layer.with_filter(stderr_filter)) // Human-readable, SuspendingWriter, -v controlled
|
||||
.with(file_layer.with_filter(file_filter)) // JSON, daily rotation, always DEBUG+
|
||||
```
|
||||
|
||||
**stderr layer**: Same `fmt::layer()` as today with `SuspendingWriter`, but level controlled by `-v` flags. When `--log-format json` is passed, this layer switches to `fmt::layer().json()` (same JSON format as file layer, but still routed through `SuspendingWriter` for progress bar coordination).
|
||||
|
||||
**file layer**: Always-on JSON output to `~/.local/share/lore/logs/`, daily rotation via `tracing-appender`. Uses its own `EnvFilter` set to `lore=debug,warn` regardless of `-v` flags, ensuring post-mortem data is always available. The file layer does NOT use `SuspendingWriter` — it writes to a file, not stderr, so progress bar coordination is unnecessary.
|
||||
|
||||
**Filter architecture**: Per-layer filtering (not a single shared `EnvFilter`) is required because the file layer must always be at DEBUG+ while stderr follows `-v`. `tracing-subscriber`'s `Layer::with_filter()` method enables this.
|
||||
|
||||
**`RUST_LOG` override**: When `RUST_LOG` is set, it overrides BOTH layer filters. This is the expert escape hatch.
|
||||
|
||||
**Current subscriber** (`src/main.rs:44-58`):
|
||||
```rust
|
||||
tracing_subscriber::registry()
|
||||
.with(
|
||||
tracing_subscriber::fmt::layer()
|
||||
.with_target(false)
|
||||
.with_writer(lore::cli::progress::SuspendingWriter),
|
||||
)
|
||||
.with(
|
||||
EnvFilter::from_default_env()
|
||||
.add_directive("lore=info".parse().unwrap())
|
||||
.add_directive("warn".parse().unwrap()),
|
||||
)
|
||||
.init();
|
||||
```
|
||||
|
||||
This will be replaced by the dual-layer setup. The `SuspendingWriter` integration and `with_target(false)` on the stderr layer remain unchanged.
|
||||
|
||||
### 4.3 Verbosity Levels
|
||||
|
||||
#### stderr layer (controlled by `-v` flags)
|
||||
|
||||
| Flags | App Level | Dep Level | Behavior |
|
||||
|---|---|---|---|
|
||||
| (none) | INFO | WARN | Default. Stage transitions, summaries. |
|
||||
| `-q` | WARN | ERROR | Errors and warnings only. |
|
||||
| `-v` | DEBUG | WARN | Detailed app behavior. API pages, skip reasons. |
|
||||
| `-vv` | DEBUG | INFO | App + dependency detail. HTTP client, SQLite. |
|
||||
| `-vvv` | TRACE | DEBUG | Everything. Wire-level detail. |
|
||||
| `RUST_LOG=...` | (overrides all) | (overrides all) | Expert escape hatch. |
|
||||
|
||||
Precedence: `RUST_LOG` > `-v` flags > defaults. This matches cargo's behavior.
|
||||
|
||||
#### file layer (independent of `-v` flags)
|
||||
|
||||
| Condition | App Level | Dep Level |
|
||||
|---|---|---|
|
||||
| Always (default) | DEBUG | WARN |
|
||||
| `RUST_LOG=...` set | (overrides) | (overrides) |
|
||||
|
||||
The file layer always captures DEBUG+ for the `lore` crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. `RUST_LOG` overrides both layers when set.
|
||||
|
||||
#### New CLI flags
|
||||
|
||||
Add to the `Cli` struct (`src/cli/mod.rs`):
|
||||
|
||||
```rust
|
||||
/// Increase log verbosity (-v, -vv, -vvv)
|
||||
#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)]
|
||||
pub verbose: u8,
|
||||
|
||||
/// Log format for stderr output: text (default) or json
|
||||
#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")]
|
||||
pub log_format: String,
|
||||
```
|
||||
|
||||
The `-v` flag uses `clap::ArgAction::Count` to support `-v`, `-vv`, `-vvv` as a single flag with increasing count. The `--log-format` flag controls whether stderr emits human-readable or JSON-formatted log lines.
|
||||
|
||||
### 4.4 Structured Log File Output
|
||||
|
||||
**Location**: `~/.local/share/lore/logs/lore.YYYY-MM-DD.log`
|
||||
|
||||
**Format**: One JSON object per line (JSONL), produced by `tracing-subscriber`'s `fmt::layer().json()`:
|
||||
|
||||
```json
|
||||
{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}}
|
||||
```
|
||||
|
||||
**Rotation**: Daily via `tracing-appender::rolling::daily()`.
|
||||
|
||||
**Retention**: Configurable, default 30 days. A `logs.retention_days` config field. Cleanup runs at startup (check directory, delete files older than N days).
|
||||
|
||||
### 4.5 Tracing Spans
|
||||
|
||||
Introduce spans for causal correlation within a single invocation:
|
||||
|
||||
```
|
||||
sync (run_id=uuid)
|
||||
+-- ingest_issues
|
||||
| +-- fetch_pages (project="group/repo")
|
||||
| +-- sync_discussions (project="group/repo")
|
||||
| +-- fetch_resource_events (project="group/repo")
|
||||
+-- ingest_mrs
|
||||
| +-- fetch_pages (project="group/repo")
|
||||
| +-- sync_discussions (project="group/repo")
|
||||
+-- generate_docs
|
||||
+-- embed
|
||||
```
|
||||
|
||||
Each span records `elapsed_ms` on close. The `run_id` propagates to all child spans and log events, enabling `jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log` to extract an entire run.
|
||||
|
||||
### 4.6 Performance Metrics
|
||||
|
||||
#### 4.6.1 Per-Stage Timing
|
||||
|
||||
Every command collects a `Vec<StageTiming>`:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct StageTiming {
|
||||
pub name: String, // "ingest_issues", "fetch_pages", etc.
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub project: Option<String>, // Which project, if applicable
|
||||
pub elapsed_ms: u64,
|
||||
pub items_processed: usize,
|
||||
#[serde(skip_serializing_if = "is_zero")]
|
||||
pub items_skipped: usize,
|
||||
#[serde(skip_serializing_if = "is_zero")]
|
||||
pub errors: usize,
|
||||
#[serde(skip_serializing_if = "Vec::is_empty")]
|
||||
pub sub_stages: Vec<StageTiming>, // Nested child stages
|
||||
}
|
||||
```
|
||||
|
||||
**Collection mechanism**: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds `#[instrument]` spans to each sync stage. Phase 3 adds a custom `tracing-subscriber` layer that records span enter/exit times and structured fields, then extracts the span tree into `Vec<StageTiming>` when the root span closes.
|
||||
|
||||
This means:
|
||||
- No mutable timing collector threaded through `run_ingest` → `fetch_pages` → `sync_discussions`
|
||||
- Spans are the single source of truth for timing
|
||||
- `StageTiming` is a materialized view of the span tree
|
||||
- The custom layer implements `on_close` to capture `elapsed` and `on_record` to capture structured fields like `items_processed`
|
||||
|
||||
**Where to define**: `src/core/metrics.rs` (new file — genuinely new functionality that doesn't fit in any existing file)
|
||||
|
||||
#### 4.6.2 Robot JSON Meta Enhancement
|
||||
|
||||
Currently:
|
||||
```json
|
||||
{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } }
|
||||
```
|
||||
|
||||
Proposed:
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"data": { ... },
|
||||
"meta": {
|
||||
"run_id": "a1b2c3d4",
|
||||
"elapsed_ms": 45230,
|
||||
"stages": [
|
||||
{
|
||||
"name": "ingest_issues",
|
||||
"elapsed_ms": 12340,
|
||||
"items_processed": 150,
|
||||
"items_skipped": 30,
|
||||
"errors": 0,
|
||||
"sub_stages": [
|
||||
{ "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 },
|
||||
{ "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 }
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "ingest_mrs",
|
||||
"elapsed_ms": 18900,
|
||||
"items_processed": 85,
|
||||
"items_skipped": 12,
|
||||
"errors": 1
|
||||
},
|
||||
{ "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 },
|
||||
{ "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 }
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 4.6.3 Sync History Enrichment
|
||||
|
||||
**Prerequisite bug fix**: The `sync_runs` table (migration 001) exists with columns `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — but **no code ever writes to it**. The `sync_status.rs` command reads from it but always gets zero rows. This must be fixed before enrichment.
|
||||
|
||||
**Step 1: Wire up sync_runs lifecycle** (prerequisite, in Phase 4)
|
||||
|
||||
Add INSERT/UPDATE calls to the sync and ingest command handlers:
|
||||
|
||||
```rust
|
||||
// On sync/ingest start:
|
||||
INSERT INTO sync_runs (started_at, heartbeat_at, status, command)
|
||||
VALUES (?now_ms, ?now_ms, 'running', ?command_name)
|
||||
RETURNING id;
|
||||
|
||||
// On sync/ingest success:
|
||||
UPDATE sync_runs
|
||||
SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics
|
||||
WHERE id = ?run_id;
|
||||
|
||||
// On sync/ingest failure:
|
||||
UPDATE sync_runs
|
||||
SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics
|
||||
WHERE id = ?run_id;
|
||||
```
|
||||
|
||||
**Where**: Add a `SyncRunRecorder` helper in `src/core/db.rs` or `src/core/sync_run.rs` that encapsulates the INSERT/UPDATE lifecycle. Called from `run_sync()` in `src/cli/commands/sync.rs` and `run_ingest()` in `src/cli/commands/ingest.rs`.
|
||||
|
||||
**Step 2: Schema migration** (migration 014)
|
||||
|
||||
Add dedicated queryable columns alongside the existing `metrics_json`:
|
||||
|
||||
```sql
|
||||
-- Migration 014: sync_runs enrichment for observability
|
||||
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
|
||||
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
|
||||
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
|
||||
|
||||
-- Index for correlation queries
|
||||
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
|
||||
```
|
||||
|
||||
The existing `metrics_json` column stores the detailed `Vec<StageTiming>` as a JSON array. No need for a separate `stages_json` column.
|
||||
|
||||
**Step 3: Enhanced sync-status display**
|
||||
|
||||
`lore sync-status` (`src/cli/commands/sync_status.rs`) currently shows only the last run. Enhance to show recent runs with metrics:
|
||||
|
||||
```
|
||||
Recent sync runs:
|
||||
Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error
|
||||
Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors
|
||||
Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors
|
||||
```
|
||||
|
||||
Robot mode (`lore --robot sync-status`):
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"data": {
|
||||
"runs": [
|
||||
{
|
||||
"run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
|
||||
"started_at": "2026-02-04T14:32:01.123Z",
|
||||
"elapsed_ms": 45230,
|
||||
"status": "succeeded",
|
||||
"command": "sync",
|
||||
"total_items_processed": 235,
|
||||
"total_errors": 1,
|
||||
"stages": [...]
|
||||
}
|
||||
],
|
||||
"cursors": [...],
|
||||
"summary": {...}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `stages` array is parsed from `metrics_json` and included in the robot output. Interactive mode shows the summary table above; `lore --robot sync-status --run a1b2c3` shows a single run's full stage breakdown.
|
||||
|
||||
#### 4.6.4 Human-Readable Timing
|
||||
|
||||
At the end of `lore sync` (interactive mode), print a timing summary:
|
||||
|
||||
```
|
||||
Sync complete in 45.2s
|
||||
Ingest issues .... 12.3s (150 items, 42 discussions)
|
||||
Ingest MRs ....... 18.9s (85 items, 1 error)
|
||||
Generate docs .... 8.5s (235 documents)
|
||||
Embed ............ 5.5s (1024 chunks)
|
||||
```
|
||||
|
||||
Gated behind `display.show_text` so it doesn't appear in progress_only or silent modes.
|
||||
|
||||
### 4.7 Rate Limit and Retry Transparency
|
||||
|
||||
Currently, rate limits emit a `tracing::warn!`. Enhance to:
|
||||
|
||||
- Log at INFO level (not just WARN) with structured fields: `info!(path, attempt, retry_after_secs, "Rate limited, retrying")`.
|
||||
- Count total rate-limit hits per run and include in stage timing.
|
||||
- In `-v` mode, show retry progress on stderr: ` Retrying /api/v4/projects/123/issues (429, waiting 2s)`.
|
||||
|
||||
### 4.8 Configuration
|
||||
|
||||
Add a new `logging` section to `Config` (`src/core/config.rs`):
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Deserialize)]
|
||||
pub struct LoggingConfig {
|
||||
/// Directory for log files. Default: ~/.local/share/lore/logs/
|
||||
#[serde(default)]
|
||||
pub log_dir: Option<String>,
|
||||
|
||||
/// Days to retain log files. Default: 30. Set to 0 to disable file logging.
|
||||
#[serde(default = "default_retention_days")]
|
||||
pub retention_days: u32,
|
||||
|
||||
/// Enable JSON log files. Default: true.
|
||||
#[serde(default = "default_true")]
|
||||
pub file_logging: bool,
|
||||
}
|
||||
|
||||
fn default_retention_days() -> u32 { 30 }
|
||||
fn default_true() -> bool { true }
|
||||
```
|
||||
|
||||
Add to the `Config` struct:
|
||||
|
||||
```rust
|
||||
#[serde(default)]
|
||||
pub logging: LoggingConfig,
|
||||
```
|
||||
|
||||
With `config.json`:
|
||||
```json
|
||||
{
|
||||
"logging": {
|
||||
"log_dir": null,
|
||||
"retention_days": 30,
|
||||
"file_logging": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Defaults are sane so existing configs continue working with zero changes.
|
||||
|
||||
**CLI flags** (added to `Cli` struct in `src/cli/mod.rs`):
|
||||
|
||||
| Flag | Type | Default | Description |
|
||||
|---|---|---|---|
|
||||
| `-v` / `--verbose` | count (u8) | 0 | Increase stderr log verbosity. Stacks: `-v`, `-vv`, `-vvv`. |
|
||||
| `--log-format` | text \| json | text | Stderr log format. `json` emits one JSON object per log line (same schema as file layer). |
|
||||
|
||||
These are global flags (`global = true`) consistent with the existing `--quiet`, `--robot`, etc.
|
||||
|
||||
---
|
||||
|
||||
## 5. Implementation Plan
|
||||
|
||||
### Phase 1: Verbosity Flags + Structured File Logging
|
||||
|
||||
**Scope**: CLI flags, dual-layer subscriber, file logging, rotation, retention, `--log-format`.
|
||||
|
||||
**Files touched**:
|
||||
- `Cargo.toml` — add `tracing-appender` dependency
|
||||
- `src/cli/mod.rs` — add `-v`/`--verbose` (count) and `--log-format` flags to `Cli` struct
|
||||
- `src/main.rs` — replace subscriber initialization (lines 44-58) with dual-layer setup
|
||||
- `src/core/config.rs` — add `LoggingConfig` struct and `logging` field to `Config`
|
||||
- `src/core/paths.rs` — add `get_log_dir()` helper (XDG data dir + `/logs/`)
|
||||
- `src/cli/commands/doctor.rs` — add log file location and disk usage check
|
||||
|
||||
**Implementation steps**:
|
||||
1. Add `-v` / `--verbose` (count, `u8`) and `--log-format` (text|json) flags to `Cli` struct.
|
||||
2. Add `tracing-appender` dependency to `Cargo.toml`.
|
||||
3. Add `LoggingConfig` to `Config` with `#[serde(default)]`.
|
||||
4. Add `get_log_dir()` to `src/core/paths.rs` (mirrors `get_db_path()` pattern).
|
||||
5. Replace subscriber init in `main.rs`:
|
||||
- Build `stderr_filter` from `-v` count (or `RUST_LOG` if set).
|
||||
- Build `file_filter` as `lore=debug,warn` (or `RUST_LOG` if set).
|
||||
- stderr layer: `fmt::layer().with_writer(SuspendingWriter)` with `stderr_filter`. When `--log-format json`, chain `.json()`.
|
||||
- file layer: `fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore"))` with `file_filter`.
|
||||
- Combine via `registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter))`.
|
||||
6. Implement log retention at startup: scan `log_dir`, delete files matching `lore.*.log` pattern older than `retention_days`. Run before subscriber init so deleted files aren't held open.
|
||||
7. Add log file check to `lore doctor`: report log directory path, number of log files, total disk usage. In robot mode, add a `logging` field to `DoctorChecks` with `log_dir`, `file_count`, `total_bytes`, `oldest_file`.
|
||||
|
||||
**New dependencies**: `tracing-appender` (0.2)
|
||||
|
||||
**Interaction with `-q`/`--quiet`**: The existing `--quiet` flag suppresses non-error terminal output via `IngestDisplay::silent()`. It should NOT affect file logging (file layer is always on). When `-q` and `-v` are both passed, `-q` wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+.
|
||||
|
||||
**Tests** (see Section 6.1 for details):
|
||||
- Unit: `EnvFilter` construction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE)
|
||||
- Unit: `RUST_LOG` overrides `-v` flags
|
||||
- Unit: `-q` + `-v` interaction (quiet wins)
|
||||
- Unit: `LoggingConfig` deserialization with missing/partial/full fields
|
||||
- Unit: Log retention deletes old files, preserves recent ones
|
||||
- Integration: Subscriber produces JSON lines to a test file
|
||||
- Integration: `SuspendingWriter` still works with dual-layer stack (no garbled output)
|
||||
|
||||
### Phase 2: Spans + Correlation IDs
|
||||
|
||||
**Scope**: Tracing spans, UUID-based `run_id`, span recording for JSON logs.
|
||||
|
||||
**Depends on**: Phase 1 (subscriber must support span recording).
|
||||
|
||||
**Files touched**:
|
||||
- `src/cli/commands/sync.rs` — add root span with `run_id` field to `run_sync()`
|
||||
- `src/cli/commands/ingest.rs` — add `#[instrument]` spans to `run_ingest()` and its stages
|
||||
- `src/ingestion/orchestrator.rs` — add spans for `fetch_pages`, `sync_discussions`, `fetch_resource_events`
|
||||
- `src/documents/regenerator.rs` — add span for `generate_docs` stage
|
||||
- `src/embedding/pipeline.rs` — add span for `embed` stage
|
||||
- `src/main.rs` — generate `run_id` before calling command handler, pass as field
|
||||
|
||||
**Implementation steps**:
|
||||
1. Generate `run_id` using `Uuid::new_v4().to_string()[..8]` (first 8 chars of UUIDv4) at command entry in `main.rs`. No new dependency needed — `uuid` v1 with v4 feature is already in `Cargo.toml`.
|
||||
2. Create root span: `let _root = tracing::info_span!("sync", run_id = %run_id).entered();` (or equivalent for each command).
|
||||
3. Add `#[instrument(skip_all, fields(stage = "ingest_issues"))]` to ingest stages.
|
||||
4. Add `#[instrument(skip_all, fields(project = %project_path))]` to per-project functions.
|
||||
5. Ensure the file layer's JSON formatter includes span context. `tracing-subscriber`'s `fmt::layer().json()` includes the current span chain by default when the registry has span storage enabled.
|
||||
6. Verify: parse a log file, confirm every line includes `span.run_id`.
|
||||
|
||||
**New dependencies**: None (`uuid` already present).
|
||||
|
||||
**Tests**:
|
||||
- Unit: `run_id` is a valid 8-character hex string
|
||||
- Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains `run_id` in span context
|
||||
- Integration: Nested spans produce correct parent-child relationships in JSON output
|
||||
|
||||
### Phase 3: Performance Metrics Collection
|
||||
|
||||
**Scope**: `StageTiming` struct, span-to-metrics extraction, robot JSON enrichment, timing summary.
|
||||
|
||||
**Depends on**: Phase 2 (spans must exist to extract timing from).
|
||||
|
||||
**Files touched**:
|
||||
- `src/core/metrics.rs` — new file: `StageTiming` struct, `MetricsLayer` (custom tracing layer), span-to-timing extraction
|
||||
- `src/cli/commands/sync.rs` — consume `Vec<StageTiming>` from `MetricsLayer`, include in `SyncMeta`
|
||||
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
|
||||
- `src/main.rs` — register `MetricsLayer` in the subscriber stack
|
||||
|
||||
**Implementation steps**:
|
||||
1. Define `StageTiming` struct with `sub_stages: Vec<StageTiming>` in `src/core/metrics.rs`.
|
||||
2. Implement `MetricsLayer` as a custom `tracing_subscriber::Layer`:
|
||||
- `on_new_span`: Record span ID, name, parent, start time.
|
||||
- `on_record`: Capture structured fields (`items_processed`, `items_skipped`, `errors`) recorded via `Span::record()`.
|
||||
- `on_close`: Calculate `elapsed_ms`, build `StageTiming` entry, attach to parent.
|
||||
- Provide `fn extract_timings(&self, run_id: &str) -> Vec<StageTiming>` to materialize the span tree after the root span closes.
|
||||
3. Store `MetricsLayer` reference (behind `Arc`) so command handlers can call `extract_timings()` after `run_sync()` completes.
|
||||
4. Extend `SyncMeta` and `SyncJsonOutput` to include `run_id: String` and `stages: Vec<StageTiming>`.
|
||||
5. Print human-readable timing summary at end of interactive sync (gated behind `IngestDisplay::show_text`).
|
||||
|
||||
**Span field recording**: Sync stages must record item counts as span fields for `MetricsLayer` to capture:
|
||||
```rust
|
||||
let span = tracing::info_span!("ingest_issues");
|
||||
let _guard = span.enter();
|
||||
// ... do work ...
|
||||
span.record("items_processed", count);
|
||||
span.record("items_skipped", skipped);
|
||||
```
|
||||
|
||||
**Tests**:
|
||||
- Unit: `StageTiming` serialization matches expected JSON (including nested `sub_stages`)
|
||||
- Unit: `MetricsLayer` correctly builds span tree from synthetic span events
|
||||
- Unit: `MetricsLayer` handles spans with no children (leaf stages like `embed`)
|
||||
- Unit: `MetricsLayer` handles concurrent spans (multiple projects in parallel)
|
||||
- Integration: `lore --robot sync` output includes `meta.stages` array with correct nesting
|
||||
- Integration: Interactive sync prints timing summary table to stderr
|
||||
|
||||
### Phase 4: Sync History Enrichment
|
||||
|
||||
**Scope**: Wire up `sync_runs` INSERT/UPDATE lifecycle, schema migration, enhanced sync-status.
|
||||
|
||||
**Depends on**: Phase 3 (needs `Vec<StageTiming>` to store in `metrics_json`).
|
||||
|
||||
**Files touched**:
|
||||
- `migrations/014_sync_runs_enrichment.sql` — new migration: add `run_id`, `total_items_processed`, `total_errors` columns + index
|
||||
- `src/core/sync_run.rs` — new file: `SyncRunRecorder` struct encapsulating INSERT on start, UPDATE on finish
|
||||
- `src/cli/commands/sync.rs` — create `SyncRunRecorder` before pipeline, finalize after
|
||||
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
|
||||
- `src/cli/commands/sync_status.rs` — enhance to show recent runs with metrics, parse `metrics_json`
|
||||
|
||||
**Implementation steps**:
|
||||
1. Create migration `014_sync_runs_enrichment.sql`:
|
||||
```sql
|
||||
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
|
||||
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
|
||||
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
|
||||
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
|
||||
```
|
||||
Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly.
|
||||
2. Implement `SyncRunRecorder`:
|
||||
```rust
|
||||
pub struct SyncRunRecorder { id: i64, conn: Connection }
|
||||
impl SyncRunRecorder {
|
||||
pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result<Self>;
|
||||
pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>;
|
||||
pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>;
|
||||
}
|
||||
```
|
||||
3. In `run_sync()`: create `SyncRunRecorder::start()` before pipeline, call `.succeed()` or `.fail()` after.
|
||||
4. In `run_ingest()`: same pattern.
|
||||
5. Enhance `sync_status.rs`:
|
||||
- Query last N runs (default 10) instead of just the last 1.
|
||||
- Parse `metrics_json` column to extract stage breakdown.
|
||||
- Show `run_id`, duration, item counts, error counts in both interactive and robot modes.
|
||||
- Add `--run <run_id>` flag to `sync-status` for single-run detail view.
|
||||
|
||||
**Tests**:
|
||||
- Unit: `SyncRunRecorder::start` inserts a row with status='running'
|
||||
- Unit: `SyncRunRecorder::succeed` updates status, sets finished_at, writes metrics_json
|
||||
- Unit: `SyncRunRecorder::fail` updates status, sets error, sets finished_at
|
||||
- Unit: Migration 014 applies cleanly on top of migration 013
|
||||
- Integration: `lore sync` creates a sync_runs row; `lore sync-status` displays it
|
||||
- Integration: `lore --robot sync-status` JSON includes `runs` array with stage breakdowns
|
||||
- Integration: Failed sync records error in sync_runs with partial metrics
|
||||
|
||||
### Phase 5: Rate Limit + Retry Instrumentation
|
||||
|
||||
**Scope**: Enhanced logging in GitLab client, retry counters in stage timing.
|
||||
|
||||
**Depends on**: Phase 2 (spans for context), Phase 3 (StageTiming for counters).
|
||||
|
||||
**Files touched**:
|
||||
- `src/gitlab/client.rs` (or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log events
|
||||
- `src/core/metrics.rs` — add `rate_limit_hits` and `retries` fields to `StageTiming`
|
||||
|
||||
**Implementation steps**:
|
||||
1. Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields:
|
||||
```rust
|
||||
info!(
|
||||
path = %request_path,
|
||||
attempt = attempt_number,
|
||||
retry_after_secs = retry_after,
|
||||
status_code = 429,
|
||||
"Rate limited, retrying"
|
||||
);
|
||||
```
|
||||
2. Add `rate_limit_hits: usize` and `retries: usize` fields to `StageTiming` (with `#[serde(skip_serializing_if = "is_zero")]`).
|
||||
3. In `MetricsLayer`, count rate-limit and retry events within each span and include in `StageTiming`.
|
||||
4. In `-v` mode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1.
|
||||
|
||||
**Tests**:
|
||||
- Unit: Rate-limit log events include all required structured fields
|
||||
- Unit: `StageTiming` serialization includes `rate_limit_hits` and `retries` when non-zero, omits when zero
|
||||
- Integration: Simulate 429 response, verify log line has `path`, `attempt`, `retry_after_secs` fields
|
||||
- Integration: After simulated retries, `StageTiming` counts match expected values
|
||||
|
||||
---
|
||||
|
||||
## 6. Acceptance Criteria
|
||||
|
||||
### 6.1 Phase 1: Verbosity Flags + Structured File Logging
|
||||
|
||||
**Functional criteria**:
|
||||
- [ ] `lore sync` writes JSON log lines to `~/.local/share/lore/logs/lore.YYYY-MM-DD.log` with zero configuration.
|
||||
- [ ] `lore -v sync` shows DEBUG-level `lore::*` output on stderr; dependency output stays at WARN.
|
||||
- [ ] `lore -vv sync` shows DEBUG-level `lore::*` + INFO-level dependency output on stderr.
|
||||
- [ ] `lore -vvv sync` shows TRACE-level output for everything on stderr.
|
||||
- [ ] `RUST_LOG=lore::gitlab=trace lore sync` overrides `-v` flags for both stderr and file layers.
|
||||
- [ ] `lore --log-format json sync` emits JSON-formatted log lines on stderr (same schema as file layer).
|
||||
- [ ] Log files rotate daily (new file per calendar day).
|
||||
- [ ] Files matching `lore.*.log` older than `retention_days` are deleted on startup.
|
||||
- [ ] Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format).
|
||||
- [ ] `--quiet` suppresses non-error stderr output. `-q` + `-v` together: `-q` wins (stderr at WARN+).
|
||||
- [ ] `--quiet` does NOT affect file logging (file layer remains at DEBUG+).
|
||||
- [ ] `lore doctor` reports: log directory path, number of log files, total disk usage in bytes. Robot mode includes a `logging` field in the checks JSON.
|
||||
- [ ] File layer always logs at DEBUG+ for `lore::*` crate regardless of `-v` flags.
|
||||
|
||||
**Test specifications**:
|
||||
- `test_verbosity_filter_construction`: Given verbosity count 0/1/2/3, assert the resulting `EnvFilter` matches the expected directives table.
|
||||
- `test_rust_log_overrides_verbose`: Set `RUST_LOG=lore=trace`, pass `-v` (count=1), assert the filter uses TRACE (not DEBUG).
|
||||
- `test_quiet_overrides_verbose`: Pass `-q` and `-v` together, assert stderr filter is WARN+.
|
||||
- `test_logging_config_defaults`: Deserialize an empty `{}` JSON as `LoggingConfig`, assert `retention_days=30`, `file_logging=true`, `log_dir=None`.
|
||||
- `test_logging_config_partial`: Deserialize `{"retention_days": 7}`, assert `file_logging=true` default preserved.
|
||||
- `test_log_retention_cleanup`: Create temp dir with files named `lore.2026-01-01.log` through `lore.2026-02-04.log`. Run retention with `retention_days=7`. Assert files older than 7 days are deleted, recent files preserved.
|
||||
- `test_log_retention_ignores_non_log_files`: Create temp dir with `lore.2026-01-01.log` and `other.txt`. Run retention. Assert `other.txt` is NOT deleted.
|
||||
- `test_json_log_output_format`: Capture file layer output, parse each line as JSON, assert keys: `timestamp`, `level`, `target`, `fields`, `span`.
|
||||
- `test_suspending_writer_dual_layer`: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines).
|
||||
|
||||
### 6.2 Phase 2: Spans + Correlation IDs
|
||||
|
||||
**Functional criteria**:
|
||||
- [ ] Every log line within a sync run includes `run_id` in the JSON span context.
|
||||
- [ ] `jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.log` extracts all lines from a run.
|
||||
- [ ] Nested spans produce a chain: log lines inside `fetch_pages` include both the `fetch_pages` span and the parent `ingest_issues` span in their span context.
|
||||
- [ ] `run_id` is an 8-character hex string (truncated UUIDv4).
|
||||
- [ ] Spans are visible in `-vv` stderr output as bracketed context.
|
||||
|
||||
**Test specifications**:
|
||||
- `test_run_id_format`: Generate 100 run_ids, assert each is 8 chars, all hex characters.
|
||||
- `test_run_id_uniqueness`: Generate 1000 run_ids, assert no duplicates.
|
||||
- `test_span_context_in_json_logs`: Run a mock sync with spans, capture JSON log output, parse and verify each line has `spans` array containing `run_id`.
|
||||
- `test_nested_span_chain`: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names.
|
||||
- `test_span_elapsed_on_close`: Create a span, sleep 10ms, close it. Verify the close event records `elapsed_ms >= 10`.
|
||||
|
||||
### 6.3 Phase 3: Performance Metrics Collection
|
||||
|
||||
**Functional criteria**:
|
||||
- [ ] `lore --robot sync` JSON includes `meta.run_id` (string) and `meta.stages` (array).
|
||||
- [ ] Each stage in `meta.stages` has: `name`, `elapsed_ms`, `items_processed`.
|
||||
- [ ] Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have `sub_stages` arrays.
|
||||
- [ ] Sub-stages include `project` field when applicable.
|
||||
- [ ] `lore sync` (interactive) prints a timing summary table on stderr, gated behind `IngestDisplay::show_text`.
|
||||
- [ ] `lore -q sync` does NOT print the timing summary.
|
||||
- [ ] Zero-value fields (`items_skipped: 0`, `errors: 0`) are omitted from JSON output.
|
||||
|
||||
**Test specifications**:
|
||||
- `test_stage_timing_serialization`: Create a `StageTiming` with sub_stages, serialize to JSON, assert structure matches PRD example.
|
||||
- `test_stage_timing_zero_fields_omitted`: Create `StageTiming` with `errors: 0`, serialize, assert no `errors` key in output.
|
||||
- `test_metrics_layer_single_span`: Create `MetricsLayer`, enter/exit one span with recorded fields, extract timings, assert one `StageTiming` entry.
|
||||
- `test_metrics_layer_nested_spans`: Create parent + child spans, extract timings, assert parent has child in `sub_stages`.
|
||||
- `test_metrics_layer_parallel_spans`: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent.
|
||||
- `test_sync_meta_includes_stages`: Mock a sync pipeline, verify robot JSON output parses correctly with `meta.stages`.
|
||||
- `test_timing_summary_format`: Capture stderr during interactive sync, verify timing table format matches PRD example.
|
||||
|
||||
### 6.4 Phase 4: Sync History Enrichment
|
||||
|
||||
**Functional criteria**:
|
||||
- [ ] `lore sync` creates a row in `sync_runs` with status='running' at start, updated to 'succeeded'/'failed' at finish.
|
||||
- [ ] `lore ingest issues` also creates a `sync_runs` row.
|
||||
- [ ] `sync_runs.run_id` matches the `run_id` in log files and robot JSON.
|
||||
- [ ] `sync_runs.metrics_json` contains the serialized `Vec<StageTiming>`.
|
||||
- [ ] `sync_runs.total_items_processed` and `total_errors` are populated.
|
||||
- [ ] `lore sync-status` shows the last 10 runs with: run_id, timestamp, duration, item count, error count.
|
||||
- [ ] `lore --robot sync-status` JSON includes `runs` array with `stages` parsed from `metrics_json`.
|
||||
- [ ] Failed syncs record the error message and any partial metrics collected before failure.
|
||||
- [ ] Migration 014 applies cleanly and is idempotent (safe to re-run).
|
||||
|
||||
**Test specifications**:
|
||||
- `test_sync_run_recorder_start`: Call `start()`, query sync_runs, assert one row with status='running'.
|
||||
- `test_sync_run_recorder_succeed`: Call `start()` then `succeed()`, assert row has status='succeeded', finished_at set, metrics_json parseable.
|
||||
- `test_sync_run_recorder_fail`: Call `start()` then `fail()`, assert row has status='failed', error set.
|
||||
- `test_sync_run_recorder_fail_with_partial_metrics`: Call `start()`, collect some metrics, then `fail()`. Assert metrics_json contains partial data.
|
||||
- `test_migration_014_applies`: Apply all migrations 001-014 on a fresh DB. Assert `sync_runs` has `run_id`, `total_items_processed`, `total_errors` columns.
|
||||
- `test_migration_014_idempotent`: Apply migration 014 twice. Assert no error on second apply.
|
||||
- `test_sync_status_shows_runs`: Insert 3 sync_runs rows, run `print_sync_status()`, assert output includes all 3 with correct formatting.
|
||||
- `test_sync_status_json_includes_stages`: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assert `runs[0].stages` is an array.
|
||||
|
||||
### 6.5 Phase 5: Rate Limit + Retry Instrumentation
|
||||
|
||||
**Functional criteria**:
|
||||
- [ ] Rate-limit events (HTTP 429) log at INFO with structured fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
|
||||
- [ ] Retry events (non-429 transient errors) log with: `path`, `attempt`, `error`.
|
||||
- [ ] `StageTiming` includes `rate_limit_hits` and `retries` counts (omitted when zero).
|
||||
- [ ] `lore -v sync` shows retry activity on stderr (visible because it's INFO+).
|
||||
- [ ] Rate limit counts are included in `metrics_json` stored in `sync_runs`.
|
||||
|
||||
**Test specifications**:
|
||||
- `test_rate_limit_log_fields`: Simulate a 429 response, capture log output, parse JSON, assert fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
|
||||
- `test_retry_log_fields`: Simulate a transient error + retry, capture log, assert fields: `path`, `attempt`, `error`.
|
||||
- `test_stage_timing_rate_limit_counts`: Simulate 3 rate-limit hits within a span, extract `StageTiming`, assert `rate_limit_hits == 3`.
|
||||
- `test_stage_timing_retry_counts`: Simulate 2 retries, extract `StageTiming`, assert `retries == 2`.
|
||||
- `test_rate_limit_fields_omitted_when_zero`: Create `StageTiming` with zero rate limits, serialize, assert no `rate_limit_hits` key.
|
||||
|
||||
---
|
||||
|
||||
## 7. Resolved Decisions
|
||||
|
||||
1. **Log format**: Use `tracing-subscriber`'s built-in JSON formatter (`fmt::layer().json()`). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter.
|
||||
|
||||
2. **Span recording**: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit.
|
||||
|
||||
3. **Log file location**: `~/.local/share/lore/logs/` (XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache.
|
||||
|
||||
4. **Retention**: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching `lore.*.log` older than `retention_days` (default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held.
|
||||
|
||||
5. **Stage timing granularity**: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The `StageTiming` struct includes an optional `project` field and a `sub_stages: Vec<StageTiming>` field for nesting.
|
||||
|
||||
6. **Stage timing collection mechanism**: Materialized from tracing spans, not plumbed manually. A custom `MetricsLayer` in the subscriber stack records span enter/exit/record events and builds the `StageTiming` tree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them.
|
||||
|
||||
7. **run_id format**: First 8 characters of `Uuid::new_v4().to_string()` (e.g., `"a1b2c3d4"`). The `uuid` crate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations.
|
||||
|
||||
8. **File log level**: Always DEBUG+ for `lore::*` crate, WARN+ for dependencies, regardless of `-v` flags. This ensures post-mortem data is always richer than what was shown on stderr. `RUST_LOG` overrides both layers when set.
|
||||
|
||||
9. **sync_runs lifecycle**: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing `metrics_json` column stores the detailed `Vec<StageTiming>` array — no need for a separate `stages_json` column.
|
||||
|
||||
10. **JSON stderr via --log-format**: A `--log-format text|json` global flag controls stderr log format. Default is `text` (human-readable). When `json`, stderr uses the same JSON formatter as the file layer, routed through `SuspendingWriter` for progress bar coordination. This enables `lore sync 2>&1 | jq` workflows without reading log files.
|
||||
|
||||
---
|
||||
|
||||
## 8. Phase Dependency Graph
|
||||
|
||||
```
|
||||
Phase 1 (Subscriber + Flags)
|
||||
|
|
||||
v
|
||||
Phase 2 (Spans + run_id)
|
||||
|
|
||||
+------+------+
|
||||
| |
|
||||
v v
|
||||
Phase 3 Phase 5
|
||||
(Metrics) (Rate Limit Logging)
|
||||
| |
|
||||
v |
|
||||
Phase 4 |
|
||||
(Sync History) <--+
|
||||
```
|
||||
|
||||
**Parallelization opportunities**:
|
||||
- Phase 1 must complete before anything else.
|
||||
- Phase 2 must complete before Phase 3 or Phase 5.
|
||||
- Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3).
|
||||
- Phase 4 depends on Phase 3 (needs `Vec<StageTiming>` to store). Phase 5's `rate_limit_hits`/`retries` fields on `StageTiming` can be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially.
|
||||
|
||||
**Agent assignment suggestion**:
|
||||
- Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure)
|
||||
- Agent B: Phase 3 (after Phase 2 completes)
|
||||
- Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3)
|
||||
- Agent B or D: Phase 4 (after Phase 3 completes)
|
||||
|
||||
---
|
||||
|
||||
## 9. References
|
||||
|
||||
- Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489.
|
||||
- Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951.
|
||||
- tokio-rs/tracing ecosystem: `tracing`, `tracing-subscriber`, `tracing-appender`.
|
||||
- GNU Coding Standards: Verbosity and diagnostic output conventions.
|
||||
Rust CLI Working Group: Recommendations for error reporting and verbosity.
|
||||
Reference in New Issue
Block a user