Files

teernisse 38da7ca47b docs: Add observability PRD and sync pipeline explorer visualization

- prd-observability.md: Product requirements document for the sync pipeline
  observability system, covering structured logging, metrics collection,
  sync run tracking, and robot-mode performance output
- gitlore-sync-explorer.html: Self-contained interactive HTML visualization
  for exploring sync pipeline stage timings and data flow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-04 13:39:22 -05:00

44 KiB

Raw Blame History

PRD: Observability Infrastructure for lore CLI

Status: Draft Author: Taylor + Claude Date: 2026-02-04

1. Problem Statement

lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond RUST_LOG, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available.

Current State

Capability	Status
Log destination	stderr only, ephemeral
Log persistence	None
Structured output	Human-readable fmt only
Verbosity control	`RUST_LOG` env var (no CLI flag)
Performance metrics	Ad-hoc `Instant::now()` in 2 commands
Timing in robot JSON	`elapsed_ms` in search and sync `meta` only
Spans / correlation	None
Log rotation	None
Per-stage timing	None
Rate limit / retry visibility	`tracing::warn!` only
Error aggregation	None
Historical comparison	None

What's Already in Place (to build on)

tracing (0.1) + tracing-subscriber (0.3) with env-filter feature
Registry-based subscriber initialized in src/main.rs:44-58 with a single fmt::layer() using SuspendingWriter
SuspendingWriter (src/cli/progress.rs:25-73) that coordinates log output with indicatif MultiProgress — buffers each log line, calls MULTI.suspend() on drop to clear progress bars before writing to stderr
IngestDisplay struct (src/cli/commands/ingest.rs:65-104) controlling UI verbosity with three modes: interactive() / silent() / progress_only()
Robot mode JSON envelope: { "ok": true, "data": {...}, "meta": {...} } — used consistently in sync, search, sync-status, and doctor commands
XDG-compliant data directory at ~/.local/share/lore/
sync_runs table (migration 001) with schema: id, started_at, heartbeat_at, finished_at, status, command, error, metrics_json — exists but is never written to (no INSERT anywhere in the codebase; sync_status.rs reads from it but always gets zero rows)
uuid crate (v1, v4 feature) already a dependency
Structured fields used in tracing calls (e.g., info!(owner = %self.owner, ...))
EnvFilter currently hardcoded: lore=info + warn default directives
Global CLI flags in src/cli/mod.rs:9-43: --config, --robot, -J, --color, --quiet (all global = true)

2. Goals

Primary

Post-mortem debugging: Any failed or slow run can be diagnosed after the fact from persistent, structured log files.
Performance visibility: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage.
Ergonomic verbosity: Users and agents control log verbosity through CLI flags (-v, -vv, -vvv) without needing to know RUST_LOG syntax.
Machine-parseable logs: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.).
Agent-friendly metrics: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring.

Secondary

Log rotation and retention: Log files don't grow unbounded; old logs are automatically cleaned up.
Correlation IDs: Every sync run gets a unique ID that connects log lines, database records, and robot output.
Rate limit and retry transparency: Every rate-limited request and retry is visible in logs with full context.
Sync history with metrics: The sync_runs table is enriched with per-stage timing, item counts, and error counts so lore sync-status becomes a real dashboard.

Non-Goals

External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1.
Real-time log streaming / tailing UI.
Alerting or notification systems.
Distributed tracing across multiple lore instances.

3. Research Foundation

3.1 The Three Pillars of Observability

Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars:

Logs -- Discrete events with context. The foundation.
Metrics -- Numerical measurements over time (counters, gauges, histograms).
Traces -- Causally ordered spans representing operations.

For a CLI tool (not a long-running service), the mapping is:

Pillar	CLI Equivalent
Logs	Structured log files per invocation
Metrics	Per-stage timing, item counts, error counts stored in DB
Traces	Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions)

3.2 Structured Logging Best Practices

From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice:

Always structured: JSON or key=value, never free-form prose in production logs.
Contextual fields propagate: A sync_run_id set at the top level appears in every downstream log line.
Levels have semantic meaning:
- ERROR: Operation failed, requires attention.
- WARN: Degraded behavior (rate limited, retry, skip).
- INFO: Significant state transitions (stage start/complete, items processed).
- DEBUG: Detailed operational data (page boundaries, individual API calls).
- TRACE: Wire-level detail (request/response bodies, SQL queries).

3.3 CLI Verbosity Conventions

From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo):

Pattern	Meaning	Precedent
(default)	INFO for app, WARN for deps	cargo, rustc
`-q` / `--quiet`	Suppress non-error output	ripgrep, fd, cargo
`-v`	DEBUG for app	ripgrep, fd
`-vv`	DEBUG for app + deps	cargo
`-vvv`	TRACE for everything	cargo, curl
`RUST_LOG=...`	Fine-grained override	Universal in Rust

The -v flag should feel familiar to anyone who has used cargo, curl, or ssh.

3.4 Log File Rotation

tracing-appender (from the tokio-rs/tracing ecosystem) provides:

Daily rotation: New file per day, named lore.2026-02-04.log.
Non-blocking writes: Dedicated writer thread, zero impact on main async runtime.
Configurable retention: Delete files older than N days.

This is the canonical solution in the Rust tracing ecosystem and requires no custom code.

3.5 Performance Metrics for CLI Tools

Inspired by hyperfine's approach to benchmarking and cargo's --timings flag:

Report wall-clock time per stage.
Report item throughput (items/sec).
Store historical runs for trend comparison.
Present timing data in both human-readable and machine-readable formats.

4. Design

4.1 Architecture Overview

                          CLI Invocation
                               |
                    +----------+----------+
                    |                     |
              Interactive Mode       Robot Mode
                    |                     |
         +---stderr (human fmt)    stdout (JSON envelope)
         |         |                     |
         |   progress bars          { ok, data, meta: {
         |   colored output           elapsed_ms,
         |                            stages: [...],
         |                            run_id
         |                          }}
         |
    Log Subscribers (layered)
         |
    +----+----+--------+
    |         |        |
  stderr    file    (future:
  (fmt)    (JSON)   OTLP)

4.2 Subscriber Stack

Replace the current single-layer subscriber with a layered registry. Each layer has its own filter:

registry()
  .with(stderr_layer.with_filter(stderr_filter))   // Human-readable, SuspendingWriter, -v controlled
  .with(file_layer.with_filter(file_filter))        // JSON, daily rotation, always DEBUG+

stderr layer: Same fmt::layer() as today with SuspendingWriter, but level controlled by -v flags. When --log-format json is passed, this layer switches to fmt::layer().json() (same JSON format as file layer, but still routed through SuspendingWriter for progress bar coordination).

file layer: Always-on JSON output to ~/.local/share/lore/logs/, daily rotation via tracing-appender. Uses its own EnvFilter set to lore=debug,warn regardless of -v flags, ensuring post-mortem data is always available. The file layer does NOT use SuspendingWriter — it writes to a file, not stderr, so progress bar coordination is unnecessary.

Filter architecture: Per-layer filtering (not a single shared EnvFilter) is required because the file layer must always be at DEBUG+ while stderr follows -v. tracing-subscriber's Layer::with_filter() method enables this.

RUST_LOG override: When RUST_LOG is set, it overrides BOTH layer filters. This is the expert escape hatch.

Current subscriber (src/main.rs:44-58):

tracing_subscriber::registry()
    .with(
        tracing_subscriber::fmt::layer()
            .with_target(false)
            .with_writer(lore::cli::progress::SuspendingWriter),
    )
    .with(
        EnvFilter::from_default_env()
            .add_directive("lore=info".parse().unwrap())
            .add_directive("warn".parse().unwrap()),
    )
    .init();

This will be replaced by the dual-layer setup. The SuspendingWriter integration and with_target(false) on the stderr layer remain unchanged.

4.3 Verbosity Levels

stderr layer (controlled by `-v` flags)

Flags	App Level	Dep Level	Behavior
(none)	INFO	WARN	Default. Stage transitions, summaries.
`-q`	WARN	ERROR	Errors and warnings only.
`-v`	DEBUG	WARN	Detailed app behavior. API pages, skip reasons.
`-vv`	DEBUG	INFO	App + dependency detail. HTTP client, SQLite.
`-vvv`	TRACE	DEBUG	Everything. Wire-level detail.
`RUST_LOG=...`	(overrides all)	(overrides all)	Expert escape hatch.

Precedence: RUST_LOG > -v flags > defaults. This matches cargo's behavior.

file layer (independent of `-v` flags)

Condition	App Level	Dep Level
Always (default)	DEBUG	WARN
`RUST_LOG=...` set	(overrides)	(overrides)

The file layer always captures DEBUG+ for the lore crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. RUST_LOG overrides both layers when set.

New CLI flags

Add to the Cli struct (src/cli/mod.rs):

/// Increase log verbosity (-v, -vv, -vvv)
#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)]
pub verbose: u8,

/// Log format for stderr output: text (default) or json
#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")]
pub log_format: String,

The -v flag uses clap::ArgAction::Count to support -v, -vv, -vvv as a single flag with increasing count. The --log-format flag controls whether stderr emits human-readable or JSON-formatted log lines.

4.4 Structured Log File Output

Location: ~/.local/share/lore/logs/lore.YYYY-MM-DD.log

Format: One JSON object per line (JSONL), produced by tracing-subscriber's fmt::layer().json():

{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}}

Rotation: Daily via tracing-appender::rolling::daily().

Retention: Configurable, default 30 days. A logs.retention_days config field. Cleanup runs at startup (check directory, delete files older than N days).

4.5 Tracing Spans

Introduce spans for causal correlation within a single invocation:

sync (run_id=uuid)
  +-- ingest_issues
  |     +-- fetch_pages (project="group/repo")
  |     +-- sync_discussions (project="group/repo")
  |     +-- fetch_resource_events (project="group/repo")
  +-- ingest_mrs
  |     +-- fetch_pages (project="group/repo")
  |     +-- sync_discussions (project="group/repo")
  +-- generate_docs
  +-- embed

Each span records elapsed_ms on close. The run_id propagates to all child spans and log events, enabling jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log to extract an entire run.

4.6 Performance Metrics

4.6.1 Per-Stage Timing

Every command collects a Vec<StageTiming>:

#[derive(Debug, Clone, Serialize)]
pub struct StageTiming {
    pub name: String,                       // "ingest_issues", "fetch_pages", etc.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub project: Option<String>,            // Which project, if applicable
    pub elapsed_ms: u64,
    pub items_processed: usize,
    #[serde(skip_serializing_if = "is_zero")]
    pub items_skipped: usize,
    #[serde(skip_serializing_if = "is_zero")]
    pub errors: usize,
    #[serde(skip_serializing_if = "Vec::is_empty")]
    pub sub_stages: Vec<StageTiming>,       // Nested child stages
}

Collection mechanism: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds #[instrument] spans to each sync stage. Phase 3 adds a custom tracing-subscriber layer that records span enter/exit times and structured fields, then extracts the span tree into Vec<StageTiming> when the root span closes.

This means:

No mutable timing collector threaded through run_ingest → fetch_pages → sync_discussions
Spans are the single source of truth for timing
StageTiming is a materialized view of the span tree
The custom layer implements on_close to capture elapsed and on_record to capture structured fields like items_processed

Where to define: src/core/metrics.rs (new file — genuinely new functionality that doesn't fit in any existing file)

4.6.2 Robot JSON Meta Enhancement

Currently:

{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } }

Proposed:

{
  "ok": true,
  "data": { ... },
  "meta": {
    "run_id": "a1b2c3d4",
    "elapsed_ms": 45230,
    "stages": [
      {
        "name": "ingest_issues",
        "elapsed_ms": 12340,
        "items_processed": 150,
        "items_skipped": 30,
        "errors": 0,
        "sub_stages": [
          { "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 },
          { "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 }
        ]
      },
      {
        "name": "ingest_mrs",
        "elapsed_ms": 18900,
        "items_processed": 85,
        "items_skipped": 12,
        "errors": 1
      },
      { "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 },
      { "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 }
    ]
  }
}

4.6.3 Sync History Enrichment

Prerequisite bug fix: The sync_runs table (migration 001) exists with columns id, started_at, heartbeat_at, finished_at, status, command, error, metrics_json — but no code ever writes to it. The sync_status.rs command reads from it but always gets zero rows. This must be fixed before enrichment.

Step 1: Wire up sync_runs lifecycle (prerequisite, in Phase 4)

Add INSERT/UPDATE calls to the sync and ingest command handlers:

// On sync/ingest start:
INSERT INTO sync_runs (started_at, heartbeat_at, status, command)
VALUES (?now_ms, ?now_ms, 'running', ?command_name)
RETURNING id;

// On sync/ingest success:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics
WHERE id = ?run_id;

// On sync/ingest failure:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics
WHERE id = ?run_id;

Where: Add a SyncRunRecorder helper in src/core/db.rs or src/core/sync_run.rs that encapsulates the INSERT/UPDATE lifecycle. Called from run_sync() in src/cli/commands/sync.rs and run_ingest() in src/cli/commands/ingest.rs.

Step 2: Schema migration (migration 014)

Add dedicated queryable columns alongside the existing metrics_json:

-- Migration 014: sync_runs enrichment for observability
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;

-- Index for correlation queries
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);

The existing metrics_json column stores the detailed Vec<StageTiming> as a JSON array. No need for a separate stages_json column.

Step 3: Enhanced sync-status display

lore sync-status (src/cli/commands/sync_status.rs) currently shows only the last run. Enhance to show recent runs with metrics:

Recent sync runs:
  Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error
  Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors
  Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors

Robot mode (lore --robot sync-status):

{
  "ok": true,
  "data": {
    "runs": [
      {
        "run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
        "started_at": "2026-02-04T14:32:01.123Z",
        "elapsed_ms": 45230,
        "status": "succeeded",
        "command": "sync",
        "total_items_processed": 235,
        "total_errors": 1,
        "stages": [...]
      }
    ],
    "cursors": [...],
    "summary": {...}
  }
}

The stages array is parsed from metrics_json and included in the robot output. Interactive mode shows the summary table above; lore --robot sync-status --run a1b2c3 shows a single run's full stage breakdown.

4.6.4 Human-Readable Timing

At the end of lore sync (interactive mode), print a timing summary:

Sync complete in 45.2s
  Ingest issues .... 12.3s (150 items, 42 discussions)
  Ingest MRs ....... 18.9s (85 items, 1 error)
  Generate docs .... 8.5s  (235 documents)
  Embed ............ 5.5s  (1024 chunks)

Gated behind display.show_text so it doesn't appear in progress_only or silent modes.

4.7 Rate Limit and Retry Transparency

Currently, rate limits emit a tracing::warn!. Enhance to:

Log at INFO level (not just WARN) with structured fields: info!(path, attempt, retry_after_secs, "Rate limited, retrying").
Count total rate-limit hits per run and include in stage timing.
In -v mode, show retry progress on stderr: Retrying /api/v4/projects/123/issues (429, waiting 2s).

4.8 Configuration

Add a new logging section to Config (src/core/config.rs):

#[derive(Debug, Clone, Deserialize)]
pub struct LoggingConfig {
    /// Directory for log files. Default: ~/.local/share/lore/logs/
    #[serde(default)]
    pub log_dir: Option<String>,

    /// Days to retain log files. Default: 30. Set to 0 to disable file logging.
    #[serde(default = "default_retention_days")]
    pub retention_days: u32,

    /// Enable JSON log files. Default: true.
    #[serde(default = "default_true")]
    pub file_logging: bool,
}

fn default_retention_days() -> u32 { 30 }
fn default_true() -> bool { true }

Add to the Config struct:

#[serde(default)]
pub logging: LoggingConfig,

With config.json:

{
  "logging": {
    "log_dir": null,
    "retention_days": 30,
    "file_logging": true
  }
}

Defaults are sane so existing configs continue working with zero changes.

CLI flags (added to Cli struct in src/cli/mod.rs):

Flag	Type	Default	Description
`-v` / `--verbose`	count (u8)	0	Increase stderr log verbosity. Stacks: `-v`, `-vv`, `-vvv`.
`--log-format`	text \| json	text	Stderr log format. `json` emits one JSON object per log line (same schema as file layer).

These are global flags (global = true) consistent with the existing --quiet, --robot, etc.

5. Implementation Plan

Phase 1: Verbosity Flags + Structured File Logging

Scope: CLI flags, dual-layer subscriber, file logging, rotation, retention, --log-format.

Files touched:

Cargo.toml — add tracing-appender dependency
src/cli/mod.rs — add -v/--verbose (count) and --log-format flags to Cli struct
src/main.rs — replace subscriber initialization (lines 44-58) with dual-layer setup
src/core/config.rs — add LoggingConfig struct and logging field to Config
src/core/paths.rs — add get_log_dir() helper (XDG data dir + /logs/)
src/cli/commands/doctor.rs — add log file location and disk usage check

Implementation steps:

Add -v / --verbose (count, u8) and --log-format (text|json) flags to Cli struct.
Add tracing-appender dependency to Cargo.toml.
Add LoggingConfig to Config with #[serde(default)].
Add get_log_dir() to src/core/paths.rs (mirrors get_db_path() pattern).
Replace subscriber init in main.rs:
- Build stderr_filter from -v count (or RUST_LOG if set).
- Build file_filter as lore=debug,warn (or RUST_LOG if set).
- stderr layer: fmt::layer().with_writer(SuspendingWriter) with stderr_filter. When --log-format json, chain .json().
- file layer: fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore")) with file_filter.
- Combine via registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter)).
Implement log retention at startup: scan log_dir, delete files matching lore.*.log pattern older than retention_days. Run before subscriber init so deleted files aren't held open.
Add log file check to lore doctor: report log directory path, number of log files, total disk usage. In robot mode, add a logging field to DoctorChecks with log_dir, file_count, total_bytes, oldest_file.

New dependencies: tracing-appender (0.2)

Interaction with -q/--quiet: The existing --quiet flag suppresses non-error terminal output via IngestDisplay::silent(). It should NOT affect file logging (file layer is always on). When -q and -v are both passed, -q wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+.

Tests (see Section 6.1 for details):

Unit: EnvFilter construction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE)
Unit: RUST_LOG overrides -v flags
Unit: -q + -v interaction (quiet wins)
Unit: LoggingConfig deserialization with missing/partial/full fields
Unit: Log retention deletes old files, preserves recent ones
Integration: Subscriber produces JSON lines to a test file
Integration: SuspendingWriter still works with dual-layer stack (no garbled output)

Phase 2: Spans + Correlation IDs

Scope: Tracing spans, UUID-based run_id, span recording for JSON logs.

Depends on: Phase 1 (subscriber must support span recording).

Files touched:

src/cli/commands/sync.rs — add root span with run_id field to run_sync()
src/cli/commands/ingest.rs — add #[instrument] spans to run_ingest() and its stages
src/ingestion/orchestrator.rs — add spans for fetch_pages, sync_discussions, fetch_resource_events
src/documents/regenerator.rs — add span for generate_docs stage
src/embedding/pipeline.rs — add span for embed stage
src/main.rs — generate run_id before calling command handler, pass as field

Implementation steps:

Generate run_id using Uuid::new_v4().to_string()[..8] (first 8 chars of UUIDv4) at command entry in main.rs. No new dependency needed — uuid v1 with v4 feature is already in Cargo.toml.
Create root span: let _root = tracing::info_span!("sync", run_id = %run_id).entered(); (or equivalent for each command).
Add #[instrument(skip_all, fields(stage = "ingest_issues"))] to ingest stages.
Add #[instrument(skip_all, fields(project = %project_path))] to per-project functions.
Ensure the file layer's JSON formatter includes span context. tracing-subscriber's fmt::layer().json() includes the current span chain by default when the registry has span storage enabled.
Verify: parse a log file, confirm every line includes span.run_id.

New dependencies: None (uuid already present).

Tests:

Unit: run_id is a valid 8-character hex string
Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains run_id in span context
Integration: Nested spans produce correct parent-child relationships in JSON output

Phase 3: Performance Metrics Collection

Scope: StageTiming struct, span-to-metrics extraction, robot JSON enrichment, timing summary.

Depends on: Phase 2 (spans must exist to extract timing from).

Files touched:

src/core/metrics.rs — new file: StageTiming struct, MetricsLayer (custom tracing layer), span-to-timing extraction
src/cli/commands/sync.rs — consume Vec<StageTiming> from MetricsLayer, include in SyncMeta
src/cli/commands/ingest.rs — same pattern for standalone ingest
src/main.rs — register MetricsLayer in the subscriber stack

Implementation steps:

Define StageTiming struct with sub_stages: Vec<StageTiming> in src/core/metrics.rs.
Implement MetricsLayer as a custom tracing_subscriber::Layer:
- on_new_span: Record span ID, name, parent, start time.
- on_record: Capture structured fields (items_processed, items_skipped, errors) recorded via Span::record().
- on_close: Calculate elapsed_ms, build StageTiming entry, attach to parent.
- Provide fn extract_timings(&self, run_id: &str) -> Vec<StageTiming> to materialize the span tree after the root span closes.
Store MetricsLayer reference (behind Arc) so command handlers can call extract_timings() after run_sync() completes.
Extend SyncMeta and SyncJsonOutput to include run_id: String and stages: Vec<StageTiming>.
Print human-readable timing summary at end of interactive sync (gated behind IngestDisplay::show_text).

Span field recording: Sync stages must record item counts as span fields for MetricsLayer to capture:

let span = tracing::info_span!("ingest_issues");
let _guard = span.enter();
// ... do work ...
span.record("items_processed", count);
span.record("items_skipped", skipped);

Tests:

Unit: StageTiming serialization matches expected JSON (including nested sub_stages)
Unit: MetricsLayer correctly builds span tree from synthetic span events
Unit: MetricsLayer handles spans with no children (leaf stages like embed)
Unit: MetricsLayer handles concurrent spans (multiple projects in parallel)
Integration: lore --robot sync output includes meta.stages array with correct nesting
Integration: Interactive sync prints timing summary table to stderr

Phase 4: Sync History Enrichment

Scope: Wire up sync_runs INSERT/UPDATE lifecycle, schema migration, enhanced sync-status.

Depends on: Phase 3 (needs Vec<StageTiming> to store in metrics_json).

Files touched:

migrations/014_sync_runs_enrichment.sql — new migration: add run_id, total_items_processed, total_errors columns + index
src/core/sync_run.rs — new file: SyncRunRecorder struct encapsulating INSERT on start, UPDATE on finish
src/cli/commands/sync.rs — create SyncRunRecorder before pipeline, finalize after
src/cli/commands/ingest.rs — same pattern for standalone ingest
src/cli/commands/sync_status.rs — enhance to show recent runs with metrics, parse metrics_json

Implementation steps:

Create migration 014_sync_runs_enrichment.sql:

ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);

Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly.

Implement SyncRunRecorder:

pub struct SyncRunRecorder { id: i64, conn: Connection }
impl SyncRunRecorder {
    pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result<Self>;
    pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>;
    pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>;
}

In run_sync(): create SyncRunRecorder::start() before pipeline, call .succeed() or .fail() after.
In run_ingest(): same pattern.
Enhance sync_status.rs:
- Query last N runs (default 10) instead of just the last 1.
- Parse metrics_json column to extract stage breakdown.
- Show run_id, duration, item counts, error counts in both interactive and robot modes.
- Add --run <run_id> flag to sync-status for single-run detail view.

Tests:

Unit: SyncRunRecorder::start inserts a row with status='running'
Unit: SyncRunRecorder::succeed updates status, sets finished_at, writes metrics_json
Unit: SyncRunRecorder::fail updates status, sets error, sets finished_at
Unit: Migration 014 applies cleanly on top of migration 013
Integration: lore sync creates a sync_runs row; lore sync-status displays it
Integration: lore --robot sync-status JSON includes runs array with stage breakdowns
Integration: Failed sync records error in sync_runs with partial metrics

Phase 5: Rate Limit + Retry Instrumentation

Scope: Enhanced logging in GitLab client, retry counters in stage timing.

Depends on: Phase 2 (spans for context), Phase 3 (StageTiming for counters).

Files touched:

src/gitlab/client.rs (or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log events
src/core/metrics.rs — add rate_limit_hits and retries fields to StageTiming

Implementation steps:

Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields:

info!(
    path = %request_path,
    attempt = attempt_number,
    retry_after_secs = retry_after,
    status_code = 429,
    "Rate limited, retrying"
);

Add rate_limit_hits: usize and retries: usize fields to StageTiming (with #[serde(skip_serializing_if = "is_zero")]).
In MetricsLayer, count rate-limit and retry events within each span and include in StageTiming.
In -v mode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1.

Tests:

Unit: Rate-limit log events include all required structured fields
Unit: StageTiming serialization includes rate_limit_hits and retries when non-zero, omits when zero
Integration: Simulate 429 response, verify log line has path, attempt, retry_after_secs fields
Integration: After simulated retries, StageTiming counts match expected values

6. Acceptance Criteria

6.1 Phase 1: Verbosity Flags + Structured File Logging

Functional criteria:

lore sync writes JSON log lines to ~/.local/share/lore/logs/lore.YYYY-MM-DD.log with zero configuration.
lore -v sync shows DEBUG-level lore::* output on stderr; dependency output stays at WARN.
lore -vv sync shows DEBUG-level lore::* + INFO-level dependency output on stderr.
lore -vvv sync shows TRACE-level output for everything on stderr.
RUST_LOG=lore::gitlab=trace lore sync overrides -v flags for both stderr and file layers.
lore --log-format json sync emits JSON-formatted log lines on stderr (same schema as file layer).
Log files rotate daily (new file per calendar day).
Files matching lore.*.log older than retention_days are deleted on startup.
Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format).
--quiet suppresses non-error stderr output. -q + -v together: -q wins (stderr at WARN+).
--quiet does NOT affect file logging (file layer remains at DEBUG+).
lore doctor reports: log directory path, number of log files, total disk usage in bytes. Robot mode includes a logging field in the checks JSON.
File layer always logs at DEBUG+ for lore::* crate regardless of -v flags.

Test specifications:

test_verbosity_filter_construction: Given verbosity count 0/1/2/3, assert the resulting EnvFilter matches the expected directives table.
test_rust_log_overrides_verbose: Set RUST_LOG=lore=trace, pass -v (count=1), assert the filter uses TRACE (not DEBUG).
test_quiet_overrides_verbose: Pass -q and -v together, assert stderr filter is WARN+.
test_logging_config_defaults: Deserialize an empty {} JSON as LoggingConfig, assert retention_days=30, file_logging=true, log_dir=None.
test_logging_config_partial: Deserialize {"retention_days": 7}, assert file_logging=true default preserved.
test_log_retention_cleanup: Create temp dir with files named lore.2026-01-01.log through lore.2026-02-04.log. Run retention with retention_days=7. Assert files older than 7 days are deleted, recent files preserved.
test_log_retention_ignores_non_log_files: Create temp dir with lore.2026-01-01.log and other.txt. Run retention. Assert other.txt is NOT deleted.
test_json_log_output_format: Capture file layer output, parse each line as JSON, assert keys: timestamp, level, target, fields, span.
test_suspending_writer_dual_layer: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines).

6.2 Phase 2: Spans + Correlation IDs

Functional criteria:

Every log line within a sync run includes run_id in the JSON span context.
jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.log extracts all lines from a run.
Nested spans produce a chain: log lines inside fetch_pages include both the fetch_pages span and the parent ingest_issues span in their span context.
run_id is an 8-character hex string (truncated UUIDv4).
Spans are visible in -vv stderr output as bracketed context.

Test specifications:

test_run_id_format: Generate 100 run_ids, assert each is 8 chars, all hex characters.
test_run_id_uniqueness: Generate 1000 run_ids, assert no duplicates.
test_span_context_in_json_logs: Run a mock sync with spans, capture JSON log output, parse and verify each line has spans array containing run_id.
test_nested_span_chain: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names.
test_span_elapsed_on_close: Create a span, sleep 10ms, close it. Verify the close event records elapsed_ms >= 10.

6.3 Phase 3: Performance Metrics Collection

Functional criteria:

lore --robot sync JSON includes meta.run_id (string) and meta.stages (array).
Each stage in meta.stages has: name, elapsed_ms, items_processed.
Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have sub_stages arrays.
Sub-stages include project field when applicable.
lore sync (interactive) prints a timing summary table on stderr, gated behind IngestDisplay::show_text.
lore -q sync does NOT print the timing summary.
Zero-value fields (items_skipped: 0, errors: 0) are omitted from JSON output.

Test specifications:

test_stage_timing_serialization: Create a StageTiming with sub_stages, serialize to JSON, assert structure matches PRD example.
test_stage_timing_zero_fields_omitted: Create StageTiming with errors: 0, serialize, assert no errors key in output.
test_metrics_layer_single_span: Create MetricsLayer, enter/exit one span with recorded fields, extract timings, assert one StageTiming entry.
test_metrics_layer_nested_spans: Create parent + child spans, extract timings, assert parent has child in sub_stages.
test_metrics_layer_parallel_spans: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent.
test_sync_meta_includes_stages: Mock a sync pipeline, verify robot JSON output parses correctly with meta.stages.
test_timing_summary_format: Capture stderr during interactive sync, verify timing table format matches PRD example.

6.4 Phase 4: Sync History Enrichment

Functional criteria:

lore sync creates a row in sync_runs with status='running' at start, updated to 'succeeded'/'failed' at finish.
lore ingest issues also creates a sync_runs row.
sync_runs.run_id matches the run_id in log files and robot JSON.
sync_runs.metrics_json contains the serialized Vec<StageTiming>.
sync_runs.total_items_processed and total_errors are populated.
lore sync-status shows the last 10 runs with: run_id, timestamp, duration, item count, error count.
lore --robot sync-status JSON includes runs array with stages parsed from metrics_json.
Failed syncs record the error message and any partial metrics collected before failure.
Migration 014 applies cleanly and is idempotent (safe to re-run).

Test specifications:

test_sync_run_recorder_start: Call start(), query sync_runs, assert one row with status='running'.
test_sync_run_recorder_succeed: Call start() then succeed(), assert row has status='succeeded', finished_at set, metrics_json parseable.
test_sync_run_recorder_fail: Call start() then fail(), assert row has status='failed', error set.
test_sync_run_recorder_fail_with_partial_metrics: Call start(), collect some metrics, then fail(). Assert metrics_json contains partial data.
test_migration_014_applies: Apply all migrations 001-014 on a fresh DB. Assert sync_runs has run_id, total_items_processed, total_errors columns.
test_migration_014_idempotent: Apply migration 014 twice. Assert no error on second apply.
test_sync_status_shows_runs: Insert 3 sync_runs rows, run print_sync_status(), assert output includes all 3 with correct formatting.
test_sync_status_json_includes_stages: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assert runs[0].stages is an array.

6.5 Phase 5: Rate Limit + Retry Instrumentation

Functional criteria:

Rate-limit events (HTTP 429) log at INFO with structured fields: path, attempt, retry_after_secs, status_code.
Retry events (non-429 transient errors) log with: path, attempt, error.
StageTiming includes rate_limit_hits and retries counts (omitted when zero).
lore -v sync shows retry activity on stderr (visible because it's INFO+).
Rate limit counts are included in metrics_json stored in sync_runs.

Test specifications:

test_rate_limit_log_fields: Simulate a 429 response, capture log output, parse JSON, assert fields: path, attempt, retry_after_secs, status_code.
test_retry_log_fields: Simulate a transient error + retry, capture log, assert fields: path, attempt, error.
test_stage_timing_rate_limit_counts: Simulate 3 rate-limit hits within a span, extract StageTiming, assert rate_limit_hits == 3.
test_stage_timing_retry_counts: Simulate 2 retries, extract StageTiming, assert retries == 2.
test_rate_limit_fields_omitted_when_zero: Create StageTiming with zero rate limits, serialize, assert no rate_limit_hits key.

7. Resolved Decisions

Log format: Use tracing-subscriber's built-in JSON formatter (fmt::layer().json()). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter.
Span recording: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit.
Log file location: ~/.local/share/lore/logs/ (XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache.
Retention: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching lore.*.log older than retention_days (default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held.
Stage timing granularity: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The StageTiming struct includes an optional project field and a sub_stages: Vec<StageTiming> field for nesting.
Stage timing collection mechanism: Materialized from tracing spans, not plumbed manually. A custom MetricsLayer in the subscriber stack records span enter/exit/record events and builds the StageTiming tree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them.
run_id format: First 8 characters of Uuid::new_v4().to_string() (e.g., "a1b2c3d4"). The uuid crate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations.
File log level: Always DEBUG+ for lore::* crate, WARN+ for dependencies, regardless of -v flags. This ensures post-mortem data is always richer than what was shown on stderr. RUST_LOG overrides both layers when set.
sync_runs lifecycle: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing metrics_json column stores the detailed Vec<StageTiming> array — no need for a separate stages_json column.
JSON stderr via --log-format: A --log-format text|json global flag controls stderr log format. Default is text (human-readable). When json, stderr uses the same JSON formatter as the file layer, routed through SuspendingWriter for progress bar coordination. This enables lore sync 2>&1 | jq workflows without reading log files.

8. Phase Dependency Graph

Phase 1 (Subscriber + Flags)
    |
    v
Phase 2 (Spans + run_id)
    |
    +------+------+
    |             |
    v             v
Phase 3         Phase 5
(Metrics)       (Rate Limit Logging)
    |             |
    v             |
Phase 4           |
(Sync History) <--+

Parallelization opportunities:

Phase 1 must complete before anything else.
Phase 2 must complete before Phase 3 or Phase 5.
Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3).
Phase 4 depends on Phase 3 (needs Vec<StageTiming> to store). Phase 5's rate_limit_hits/retries fields on StageTiming can be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially.

Agent assignment suggestion:

Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure)
Agent B: Phase 3 (after Phase 2 completes)
Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3)
Agent B or D: Phase 4 (after Phase 3 completes)

9. References

Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489.
Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951.
tokio-rs/tracing ecosystem: tracing, tracing-subscriber, tracing-appender.
GNU Coding Standards: Verbosity and diagnostic output conventions. Rust CLI Working Group: Recommendations for error reporting and verbosity.

44 KiB Raw Blame History

PRD: Observability Infrastructure for lore CLI

1. Problem Statement

Current State

What's Already in Place (to build on)

2. Goals

Primary

Secondary

Non-Goals

3. Research Foundation

3.1 The Three Pillars of Observability

3.2 Structured Logging Best Practices

3.3 CLI Verbosity Conventions

3.4 Log File Rotation

3.5 Performance Metrics for CLI Tools

4. Design

4.1 Architecture Overview

4.2 Subscriber Stack

4.3 Verbosity Levels

stderr layer (controlled by -v flags)

file layer (independent of -v flags)

New CLI flags

4.4 Structured Log File Output

4.5 Tracing Spans

4.6 Performance Metrics

4.6.1 Per-Stage Timing

4.6.2 Robot JSON Meta Enhancement

4.6.3 Sync History Enrichment

4.6.4 Human-Readable Timing

4.7 Rate Limit and Retry Transparency

4.8 Configuration

5. Implementation Plan

Phase 1: Verbosity Flags + Structured File Logging

Phase 2: Spans + Correlation IDs

Phase 3: Performance Metrics Collection

Phase 4: Sync History Enrichment

Phase 5: Rate Limit + Retry Instrumentation

6. Acceptance Criteria

6.1 Phase 1: Verbosity Flags + Structured File Logging

6.2 Phase 2: Spans + Correlation IDs

6.3 Phase 3: Performance Metrics Collection

6.4 Phase 4: Sync History Enrichment

6.5 Phase 5: Rate Limit + Retry Instrumentation

7. Resolved Decisions

8. Phase Dependency Graph

9. References

44 KiB

Raw Blame History

stderr layer (controlled by `-v` flags)

file layer (independent of `-v` flags)