- prd-observability.md: Product requirements document for the sync pipeline observability system, covering structured logging, metrics collection, sync run tracking, and robot-mode performance output - gitlore-sync-explorer.html: Self-contained interactive HTML visualization for exploring sync pipeline stage timings and data flow Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
44 KiB
PRD: Observability Infrastructure for lore CLI
Status: Draft Author: Taylor + Claude Date: 2026-02-04
1. Problem Statement
lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond RUST_LOG, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available.
Current State
| Capability | Status |
|---|---|
| Log destination | stderr only, ephemeral |
| Log persistence | None |
| Structured output | Human-readable fmt only |
| Verbosity control | RUST_LOG env var (no CLI flag) |
| Performance metrics | Ad-hoc Instant::now() in 2 commands |
| Timing in robot JSON | elapsed_ms in search and sync meta only |
| Spans / correlation | None |
| Log rotation | None |
| Per-stage timing | None |
| Rate limit / retry visibility | tracing::warn! only |
| Error aggregation | None |
| Historical comparison | None |
What's Already in Place (to build on)
tracing(0.1) +tracing-subscriber(0.3) withenv-filterfeature- Registry-based subscriber initialized in
src/main.rs:44-58with a singlefmt::layer()usingSuspendingWriter SuspendingWriter(src/cli/progress.rs:25-73) that coordinates log output with indicatifMultiProgress— buffers each log line, callsMULTI.suspend()on drop to clear progress bars before writing to stderrIngestDisplaystruct (src/cli/commands/ingest.rs:65-104) controlling UI verbosity with three modes:interactive()/silent()/progress_only()- Robot mode JSON envelope:
{ "ok": true, "data": {...}, "meta": {...} }— used consistently in sync, search, sync-status, and doctor commands - XDG-compliant data directory at
~/.local/share/lore/ sync_runstable (migration 001) with schema:id,started_at,heartbeat_at,finished_at,status,command,error,metrics_json— exists but is never written to (no INSERT anywhere in the codebase;sync_status.rsreads from it but always gets zero rows)uuidcrate (v1, v4 feature) already a dependency- Structured fields used in tracing calls (e.g.,
info!(owner = %self.owner, ...)) EnvFiltercurrently hardcoded:lore=info+warndefault directives- Global CLI flags in
src/cli/mod.rs:9-43:--config,--robot,-J,--color,--quiet(allglobal = true)
2. Goals
Primary
- Post-mortem debugging: Any failed or slow run can be diagnosed after the fact from persistent, structured log files.
- Performance visibility: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage.
- Ergonomic verbosity: Users and agents control log verbosity through CLI flags (
-v,-vv,-vvv) without needing to knowRUST_LOGsyntax. - Machine-parseable logs: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.).
- Agent-friendly metrics: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring.
Secondary
- Log rotation and retention: Log files don't grow unbounded; old logs are automatically cleaned up.
- Correlation IDs: Every sync run gets a unique ID that connects log lines, database records, and robot output.
- Rate limit and retry transparency: Every rate-limited request and retry is visible in logs with full context.
- Sync history with metrics: The
sync_runstable is enriched with per-stage timing, item counts, and error counts solore sync-statusbecomes a real dashboard.
Non-Goals
- External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1.
- Real-time log streaming / tailing UI.
- Alerting or notification systems.
- Distributed tracing across multiple lore instances.
3. Research Foundation
3.1 The Three Pillars of Observability
Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars:
- Logs -- Discrete events with context. The foundation.
- Metrics -- Numerical measurements over time (counters, gauges, histograms).
- Traces -- Causally ordered spans representing operations.
For a CLI tool (not a long-running service), the mapping is:
| Pillar | CLI Equivalent |
|---|---|
| Logs | Structured log files per invocation |
| Metrics | Per-stage timing, item counts, error counts stored in DB |
| Traces | Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions) |
3.2 Structured Logging Best Practices
From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice:
- Always structured: JSON or key=value, never free-form prose in production logs.
- Contextual fields propagate: A sync_run_id set at the top level appears in every downstream log line.
- Levels have semantic meaning:
ERROR: Operation failed, requires attention.WARN: Degraded behavior (rate limited, retry, skip).INFO: Significant state transitions (stage start/complete, items processed).DEBUG: Detailed operational data (page boundaries, individual API calls).TRACE: Wire-level detail (request/response bodies, SQL queries).
3.3 CLI Verbosity Conventions
From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo):
| Pattern | Meaning | Precedent |
|---|---|---|
| (default) | INFO for app, WARN for deps | cargo, rustc |
-q / --quiet |
Suppress non-error output | ripgrep, fd, cargo |
-v |
DEBUG for app | ripgrep, fd |
-vv |
DEBUG for app + deps | cargo |
-vvv |
TRACE for everything | cargo, curl |
RUST_LOG=... |
Fine-grained override | Universal in Rust |
The -v flag should feel familiar to anyone who has used cargo, curl, or ssh.
3.4 Log File Rotation
tracing-appender (from the tokio-rs/tracing ecosystem) provides:
- Daily rotation: New file per day, named
lore.2026-02-04.log. - Non-blocking writes: Dedicated writer thread, zero impact on main async runtime.
- Configurable retention: Delete files older than N days.
This is the canonical solution in the Rust tracing ecosystem and requires no custom code.
3.5 Performance Metrics for CLI Tools
Inspired by hyperfine's approach to benchmarking and cargo's --timings flag:
- Report wall-clock time per stage.
- Report item throughput (items/sec).
- Store historical runs for trend comparison.
- Present timing data in both human-readable and machine-readable formats.
4. Design
4.1 Architecture Overview
CLI Invocation
|
+----------+----------+
| |
Interactive Mode Robot Mode
| |
+---stderr (human fmt) stdout (JSON envelope)
| | |
| progress bars { ok, data, meta: {
| colored output elapsed_ms,
| stages: [...],
| run_id
| }}
|
Log Subscribers (layered)
|
+----+----+--------+
| | |
stderr file (future:
(fmt) (JSON) OTLP)
4.2 Subscriber Stack
Replace the current single-layer subscriber with a layered registry. Each layer has its own filter:
registry()
.with(stderr_layer.with_filter(stderr_filter)) // Human-readable, SuspendingWriter, -v controlled
.with(file_layer.with_filter(file_filter)) // JSON, daily rotation, always DEBUG+
stderr layer: Same fmt::layer() as today with SuspendingWriter, but level controlled by -v flags. When --log-format json is passed, this layer switches to fmt::layer().json() (same JSON format as file layer, but still routed through SuspendingWriter for progress bar coordination).
file layer: Always-on JSON output to ~/.local/share/lore/logs/, daily rotation via tracing-appender. Uses its own EnvFilter set to lore=debug,warn regardless of -v flags, ensuring post-mortem data is always available. The file layer does NOT use SuspendingWriter — it writes to a file, not stderr, so progress bar coordination is unnecessary.
Filter architecture: Per-layer filtering (not a single shared EnvFilter) is required because the file layer must always be at DEBUG+ while stderr follows -v. tracing-subscriber's Layer::with_filter() method enables this.
RUST_LOG override: When RUST_LOG is set, it overrides BOTH layer filters. This is the expert escape hatch.
Current subscriber (src/main.rs:44-58):
tracing_subscriber::registry()
.with(
tracing_subscriber::fmt::layer()
.with_target(false)
.with_writer(lore::cli::progress::SuspendingWriter),
)
.with(
EnvFilter::from_default_env()
.add_directive("lore=info".parse().unwrap())
.add_directive("warn".parse().unwrap()),
)
.init();
This will be replaced by the dual-layer setup. The SuspendingWriter integration and with_target(false) on the stderr layer remain unchanged.
4.3 Verbosity Levels
stderr layer (controlled by -v flags)
| Flags | App Level | Dep Level | Behavior |
|---|---|---|---|
| (none) | INFO | WARN | Default. Stage transitions, summaries. |
-q |
WARN | ERROR | Errors and warnings only. |
-v |
DEBUG | WARN | Detailed app behavior. API pages, skip reasons. |
-vv |
DEBUG | INFO | App + dependency detail. HTTP client, SQLite. |
-vvv |
TRACE | DEBUG | Everything. Wire-level detail. |
RUST_LOG=... |
(overrides all) | (overrides all) | Expert escape hatch. |
Precedence: RUST_LOG > -v flags > defaults. This matches cargo's behavior.
file layer (independent of -v flags)
| Condition | App Level | Dep Level |
|---|---|---|
| Always (default) | DEBUG | WARN |
RUST_LOG=... set |
(overrides) | (overrides) |
The file layer always captures DEBUG+ for the lore crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. RUST_LOG overrides both layers when set.
New CLI flags
Add to the Cli struct (src/cli/mod.rs):
/// Increase log verbosity (-v, -vv, -vvv)
#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)]
pub verbose: u8,
/// Log format for stderr output: text (default) or json
#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")]
pub log_format: String,
The -v flag uses clap::ArgAction::Count to support -v, -vv, -vvv as a single flag with increasing count. The --log-format flag controls whether stderr emits human-readable or JSON-formatted log lines.
4.4 Structured Log File Output
Location: ~/.local/share/lore/logs/lore.YYYY-MM-DD.log
Format: One JSON object per line (JSONL), produced by tracing-subscriber's fmt::layer().json():
{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}}
Rotation: Daily via tracing-appender::rolling::daily().
Retention: Configurable, default 30 days. A logs.retention_days config field. Cleanup runs at startup (check directory, delete files older than N days).
4.5 Tracing Spans
Introduce spans for causal correlation within a single invocation:
sync (run_id=uuid)
+-- ingest_issues
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
| +-- fetch_resource_events (project="group/repo")
+-- ingest_mrs
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
+-- generate_docs
+-- embed
Each span records elapsed_ms on close. The run_id propagates to all child spans and log events, enabling jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log to extract an entire run.
4.6 Performance Metrics
4.6.1 Per-Stage Timing
Every command collects a Vec<StageTiming>:
#[derive(Debug, Clone, Serialize)]
pub struct StageTiming {
pub name: String, // "ingest_issues", "fetch_pages", etc.
#[serde(skip_serializing_if = "Option::is_none")]
pub project: Option<String>, // Which project, if applicable
pub elapsed_ms: u64,
pub items_processed: usize,
#[serde(skip_serializing_if = "is_zero")]
pub items_skipped: usize,
#[serde(skip_serializing_if = "is_zero")]
pub errors: usize,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub sub_stages: Vec<StageTiming>, // Nested child stages
}
Collection mechanism: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds #[instrument] spans to each sync stage. Phase 3 adds a custom tracing-subscriber layer that records span enter/exit times and structured fields, then extracts the span tree into Vec<StageTiming> when the root span closes.
This means:
- No mutable timing collector threaded through
run_ingest→fetch_pages→sync_discussions - Spans are the single source of truth for timing
StageTimingis a materialized view of the span tree- The custom layer implements
on_closeto captureelapsedandon_recordto capture structured fields likeitems_processed
Where to define: src/core/metrics.rs (new file — genuinely new functionality that doesn't fit in any existing file)
4.6.2 Robot JSON Meta Enhancement
Currently:
{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } }
Proposed:
{
"ok": true,
"data": { ... },
"meta": {
"run_id": "a1b2c3d4",
"elapsed_ms": 45230,
"stages": [
{
"name": "ingest_issues",
"elapsed_ms": 12340,
"items_processed": 150,
"items_skipped": 30,
"errors": 0,
"sub_stages": [
{ "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 },
{ "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 }
]
},
{
"name": "ingest_mrs",
"elapsed_ms": 18900,
"items_processed": 85,
"items_skipped": 12,
"errors": 1
},
{ "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 },
{ "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 }
]
}
}
4.6.3 Sync History Enrichment
Prerequisite bug fix: The sync_runs table (migration 001) exists with columns id, started_at, heartbeat_at, finished_at, status, command, error, metrics_json — but no code ever writes to it. The sync_status.rs command reads from it but always gets zero rows. This must be fixed before enrichment.
Step 1: Wire up sync_runs lifecycle (prerequisite, in Phase 4)
Add INSERT/UPDATE calls to the sync and ingest command handlers:
// On sync/ingest start:
INSERT INTO sync_runs (started_at, heartbeat_at, status, command)
VALUES (?now_ms, ?now_ms, 'running', ?command_name)
RETURNING id;
// On sync/ingest success:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics
WHERE id = ?run_id;
// On sync/ingest failure:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics
WHERE id = ?run_id;
Where: Add a SyncRunRecorder helper in src/core/db.rs or src/core/sync_run.rs that encapsulates the INSERT/UPDATE lifecycle. Called from run_sync() in src/cli/commands/sync.rs and run_ingest() in src/cli/commands/ingest.rs.
Step 2: Schema migration (migration 014)
Add dedicated queryable columns alongside the existing metrics_json:
-- Migration 014: sync_runs enrichment for observability
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
-- Index for correlation queries
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
The existing metrics_json column stores the detailed Vec<StageTiming> as a JSON array. No need for a separate stages_json column.
Step 3: Enhanced sync-status display
lore sync-status (src/cli/commands/sync_status.rs) currently shows only the last run. Enhance to show recent runs with metrics:
Recent sync runs:
Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error
Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors
Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors
Robot mode (lore --robot sync-status):
{
"ok": true,
"data": {
"runs": [
{
"run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"started_at": "2026-02-04T14:32:01.123Z",
"elapsed_ms": 45230,
"status": "succeeded",
"command": "sync",
"total_items_processed": 235,
"total_errors": 1,
"stages": [...]
}
],
"cursors": [...],
"summary": {...}
}
}
The stages array is parsed from metrics_json and included in the robot output. Interactive mode shows the summary table above; lore --robot sync-status --run a1b2c3 shows a single run's full stage breakdown.
4.6.4 Human-Readable Timing
At the end of lore sync (interactive mode), print a timing summary:
Sync complete in 45.2s
Ingest issues .... 12.3s (150 items, 42 discussions)
Ingest MRs ....... 18.9s (85 items, 1 error)
Generate docs .... 8.5s (235 documents)
Embed ............ 5.5s (1024 chunks)
Gated behind display.show_text so it doesn't appear in progress_only or silent modes.
4.7 Rate Limit and Retry Transparency
Currently, rate limits emit a tracing::warn!. Enhance to:
- Log at INFO level (not just WARN) with structured fields:
info!(path, attempt, retry_after_secs, "Rate limited, retrying"). - Count total rate-limit hits per run and include in stage timing.
- In
-vmode, show retry progress on stderr:Retrying /api/v4/projects/123/issues (429, waiting 2s).
4.8 Configuration
Add a new logging section to Config (src/core/config.rs):
#[derive(Debug, Clone, Deserialize)]
pub struct LoggingConfig {
/// Directory for log files. Default: ~/.local/share/lore/logs/
#[serde(default)]
pub log_dir: Option<String>,
/// Days to retain log files. Default: 30. Set to 0 to disable file logging.
#[serde(default = "default_retention_days")]
pub retention_days: u32,
/// Enable JSON log files. Default: true.
#[serde(default = "default_true")]
pub file_logging: bool,
}
fn default_retention_days() -> u32 { 30 }
fn default_true() -> bool { true }
Add to the Config struct:
#[serde(default)]
pub logging: LoggingConfig,
With config.json:
{
"logging": {
"log_dir": null,
"retention_days": 30,
"file_logging": true
}
}
Defaults are sane so existing configs continue working with zero changes.
CLI flags (added to Cli struct in src/cli/mod.rs):
| Flag | Type | Default | Description |
|---|---|---|---|
-v / --verbose |
count (u8) | 0 | Increase stderr log verbosity. Stacks: -v, -vv, -vvv. |
--log-format |
text | json | text | Stderr log format. json emits one JSON object per log line (same schema as file layer). |
These are global flags (global = true) consistent with the existing --quiet, --robot, etc.
5. Implementation Plan
Phase 1: Verbosity Flags + Structured File Logging
Scope: CLI flags, dual-layer subscriber, file logging, rotation, retention, --log-format.
Files touched:
Cargo.toml— addtracing-appenderdependencysrc/cli/mod.rs— add-v/--verbose(count) and--log-formatflags toClistructsrc/main.rs— replace subscriber initialization (lines 44-58) with dual-layer setupsrc/core/config.rs— addLoggingConfigstruct andloggingfield toConfigsrc/core/paths.rs— addget_log_dir()helper (XDG data dir +/logs/)src/cli/commands/doctor.rs— add log file location and disk usage check
Implementation steps:
- Add
-v/--verbose(count,u8) and--log-format(text|json) flags toClistruct. - Add
tracing-appenderdependency toCargo.toml. - Add
LoggingConfigtoConfigwith#[serde(default)]. - Add
get_log_dir()tosrc/core/paths.rs(mirrorsget_db_path()pattern). - Replace subscriber init in
main.rs:- Build
stderr_filterfrom-vcount (orRUST_LOGif set). - Build
file_filteraslore=debug,warn(orRUST_LOGif set). - stderr layer:
fmt::layer().with_writer(SuspendingWriter)withstderr_filter. When--log-format json, chain.json(). - file layer:
fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore"))withfile_filter. - Combine via
registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter)).
- Build
- Implement log retention at startup: scan
log_dir, delete files matchinglore.*.logpattern older thanretention_days. Run before subscriber init so deleted files aren't held open. - Add log file check to
lore doctor: report log directory path, number of log files, total disk usage. In robot mode, add aloggingfield toDoctorCheckswithlog_dir,file_count,total_bytes,oldest_file.
New dependencies: tracing-appender (0.2)
Interaction with -q/--quiet: The existing --quiet flag suppresses non-error terminal output via IngestDisplay::silent(). It should NOT affect file logging (file layer is always on). When -q and -v are both passed, -q wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+.
Tests (see Section 6.1 for details):
- Unit:
EnvFilterconstruction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE) - Unit:
RUST_LOGoverrides-vflags - Unit:
-q+-vinteraction (quiet wins) - Unit:
LoggingConfigdeserialization with missing/partial/full fields - Unit: Log retention deletes old files, preserves recent ones
- Integration: Subscriber produces JSON lines to a test file
- Integration:
SuspendingWriterstill works with dual-layer stack (no garbled output)
Phase 2: Spans + Correlation IDs
Scope: Tracing spans, UUID-based run_id, span recording for JSON logs.
Depends on: Phase 1 (subscriber must support span recording).
Files touched:
src/cli/commands/sync.rs— add root span withrun_idfield torun_sync()src/cli/commands/ingest.rs— add#[instrument]spans torun_ingest()and its stagessrc/ingestion/orchestrator.rs— add spans forfetch_pages,sync_discussions,fetch_resource_eventssrc/documents/regenerator.rs— add span forgenerate_docsstagesrc/embedding/pipeline.rs— add span forembedstagesrc/main.rs— generaterun_idbefore calling command handler, pass as field
Implementation steps:
- Generate
run_idusingUuid::new_v4().to_string()[..8](first 8 chars of UUIDv4) at command entry inmain.rs. No new dependency needed —uuidv1 with v4 feature is already inCargo.toml. - Create root span:
let _root = tracing::info_span!("sync", run_id = %run_id).entered();(or equivalent for each command). - Add
#[instrument(skip_all, fields(stage = "ingest_issues"))]to ingest stages. - Add
#[instrument(skip_all, fields(project = %project_path))]to per-project functions. - Ensure the file layer's JSON formatter includes span context.
tracing-subscriber'sfmt::layer().json()includes the current span chain by default when the registry has span storage enabled. - Verify: parse a log file, confirm every line includes
span.run_id.
New dependencies: None (uuid already present).
Tests:
- Unit:
run_idis a valid 8-character hex string - Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains
run_idin span context - Integration: Nested spans produce correct parent-child relationships in JSON output
Phase 3: Performance Metrics Collection
Scope: StageTiming struct, span-to-metrics extraction, robot JSON enrichment, timing summary.
Depends on: Phase 2 (spans must exist to extract timing from).
Files touched:
src/core/metrics.rs— new file:StageTimingstruct,MetricsLayer(custom tracing layer), span-to-timing extractionsrc/cli/commands/sync.rs— consumeVec<StageTiming>fromMetricsLayer, include inSyncMetasrc/cli/commands/ingest.rs— same pattern for standalone ingestsrc/main.rs— registerMetricsLayerin the subscriber stack
Implementation steps:
- Define
StageTimingstruct withsub_stages: Vec<StageTiming>insrc/core/metrics.rs. - Implement
MetricsLayeras a customtracing_subscriber::Layer:on_new_span: Record span ID, name, parent, start time.on_record: Capture structured fields (items_processed,items_skipped,errors) recorded viaSpan::record().on_close: Calculateelapsed_ms, buildStageTimingentry, attach to parent.- Provide
fn extract_timings(&self, run_id: &str) -> Vec<StageTiming>to materialize the span tree after the root span closes.
- Store
MetricsLayerreference (behindArc) so command handlers can callextract_timings()afterrun_sync()completes. - Extend
SyncMetaandSyncJsonOutputto includerun_id: Stringandstages: Vec<StageTiming>. - Print human-readable timing summary at end of interactive sync (gated behind
IngestDisplay::show_text).
Span field recording: Sync stages must record item counts as span fields for MetricsLayer to capture:
let span = tracing::info_span!("ingest_issues");
let _guard = span.enter();
// ... do work ...
span.record("items_processed", count);
span.record("items_skipped", skipped);
Tests:
- Unit:
StageTimingserialization matches expected JSON (including nestedsub_stages) - Unit:
MetricsLayercorrectly builds span tree from synthetic span events - Unit:
MetricsLayerhandles spans with no children (leaf stages likeembed) - Unit:
MetricsLayerhandles concurrent spans (multiple projects in parallel) - Integration:
lore --robot syncoutput includesmeta.stagesarray with correct nesting - Integration: Interactive sync prints timing summary table to stderr
Phase 4: Sync History Enrichment
Scope: Wire up sync_runs INSERT/UPDATE lifecycle, schema migration, enhanced sync-status.
Depends on: Phase 3 (needs Vec<StageTiming> to store in metrics_json).
Files touched:
migrations/014_sync_runs_enrichment.sql— new migration: addrun_id,total_items_processed,total_errorscolumns + indexsrc/core/sync_run.rs— new file:SyncRunRecorderstruct encapsulating INSERT on start, UPDATE on finishsrc/cli/commands/sync.rs— createSyncRunRecorderbefore pipeline, finalize aftersrc/cli/commands/ingest.rs— same pattern for standalone ingestsrc/cli/commands/sync_status.rs— enhance to show recent runs with metrics, parsemetrics_json
Implementation steps:
- Create migration
014_sync_runs_enrichment.sql:Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly.ALTER TABLE sync_runs ADD COLUMN run_id TEXT; ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0; ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0; CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id); - Implement
SyncRunRecorder:pub struct SyncRunRecorder { id: i64, conn: Connection } impl SyncRunRecorder { pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result<Self>; pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>; pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>; } - In
run_sync(): createSyncRunRecorder::start()before pipeline, call.succeed()or.fail()after. - In
run_ingest(): same pattern. - Enhance
sync_status.rs:- Query last N runs (default 10) instead of just the last 1.
- Parse
metrics_jsoncolumn to extract stage breakdown. - Show
run_id, duration, item counts, error counts in both interactive and robot modes. - Add
--run <run_id>flag tosync-statusfor single-run detail view.
Tests:
- Unit:
SyncRunRecorder::startinserts a row with status='running' - Unit:
SyncRunRecorder::succeedupdates status, sets finished_at, writes metrics_json - Unit:
SyncRunRecorder::failupdates status, sets error, sets finished_at - Unit: Migration 014 applies cleanly on top of migration 013
- Integration:
lore synccreates a sync_runs row;lore sync-statusdisplays it - Integration:
lore --robot sync-statusJSON includesrunsarray with stage breakdowns - Integration: Failed sync records error in sync_runs with partial metrics
Phase 5: Rate Limit + Retry Instrumentation
Scope: Enhanced logging in GitLab client, retry counters in stage timing.
Depends on: Phase 2 (spans for context), Phase 3 (StageTiming for counters).
Files touched:
src/gitlab/client.rs(or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log eventssrc/core/metrics.rs— addrate_limit_hitsandretriesfields toStageTiming
Implementation steps:
- Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields:
info!( path = %request_path, attempt = attempt_number, retry_after_secs = retry_after, status_code = 429, "Rate limited, retrying" ); - Add
rate_limit_hits: usizeandretries: usizefields toStageTiming(with#[serde(skip_serializing_if = "is_zero")]). - In
MetricsLayer, count rate-limit and retry events within each span and include inStageTiming. - In
-vmode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1.
Tests:
- Unit: Rate-limit log events include all required structured fields
- Unit:
StageTimingserialization includesrate_limit_hitsandretrieswhen non-zero, omits when zero - Integration: Simulate 429 response, verify log line has
path,attempt,retry_after_secsfields - Integration: After simulated retries,
StageTimingcounts match expected values
6. Acceptance Criteria
6.1 Phase 1: Verbosity Flags + Structured File Logging
Functional criteria:
lore syncwrites JSON log lines to~/.local/share/lore/logs/lore.YYYY-MM-DD.logwith zero configuration.lore -v syncshows DEBUG-levellore::*output on stderr; dependency output stays at WARN.lore -vv syncshows DEBUG-levellore::*+ INFO-level dependency output on stderr.lore -vvv syncshows TRACE-level output for everything on stderr.RUST_LOG=lore::gitlab=trace lore syncoverrides-vflags for both stderr and file layers.lore --log-format json syncemits JSON-formatted log lines on stderr (same schema as file layer).- Log files rotate daily (new file per calendar day).
- Files matching
lore.*.logolder thanretention_daysare deleted on startup. - Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format).
--quietsuppresses non-error stderr output.-q+-vtogether:-qwins (stderr at WARN+).--quietdoes NOT affect file logging (file layer remains at DEBUG+).lore doctorreports: log directory path, number of log files, total disk usage in bytes. Robot mode includes aloggingfield in the checks JSON.- File layer always logs at DEBUG+ for
lore::*crate regardless of-vflags.
Test specifications:
test_verbosity_filter_construction: Given verbosity count 0/1/2/3, assert the resultingEnvFiltermatches the expected directives table.test_rust_log_overrides_verbose: SetRUST_LOG=lore=trace, pass-v(count=1), assert the filter uses TRACE (not DEBUG).test_quiet_overrides_verbose: Pass-qand-vtogether, assert stderr filter is WARN+.test_logging_config_defaults: Deserialize an empty{}JSON asLoggingConfig, assertretention_days=30,file_logging=true,log_dir=None.test_logging_config_partial: Deserialize{"retention_days": 7}, assertfile_logging=truedefault preserved.test_log_retention_cleanup: Create temp dir with files namedlore.2026-01-01.logthroughlore.2026-02-04.log. Run retention withretention_days=7. Assert files older than 7 days are deleted, recent files preserved.test_log_retention_ignores_non_log_files: Create temp dir withlore.2026-01-01.logandother.txt. Run retention. Assertother.txtis NOT deleted.test_json_log_output_format: Capture file layer output, parse each line as JSON, assert keys:timestamp,level,target,fields,span.test_suspending_writer_dual_layer: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines).
6.2 Phase 2: Spans + Correlation IDs
Functional criteria:
- Every log line within a sync run includes
run_idin the JSON span context. jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.logextracts all lines from a run.- Nested spans produce a chain: log lines inside
fetch_pagesinclude both thefetch_pagesspan and the parentingest_issuesspan in their span context. run_idis an 8-character hex string (truncated UUIDv4).- Spans are visible in
-vvstderr output as bracketed context.
Test specifications:
test_run_id_format: Generate 100 run_ids, assert each is 8 chars, all hex characters.test_run_id_uniqueness: Generate 1000 run_ids, assert no duplicates.test_span_context_in_json_logs: Run a mock sync with spans, capture JSON log output, parse and verify each line hasspansarray containingrun_id.test_nested_span_chain: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names.test_span_elapsed_on_close: Create a span, sleep 10ms, close it. Verify the close event recordselapsed_ms >= 10.
6.3 Phase 3: Performance Metrics Collection
Functional criteria:
lore --robot syncJSON includesmeta.run_id(string) andmeta.stages(array).- Each stage in
meta.stageshas:name,elapsed_ms,items_processed. - Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have
sub_stagesarrays. - Sub-stages include
projectfield when applicable. lore sync(interactive) prints a timing summary table on stderr, gated behindIngestDisplay::show_text.lore -q syncdoes NOT print the timing summary.- Zero-value fields (
items_skipped: 0,errors: 0) are omitted from JSON output.
Test specifications:
test_stage_timing_serialization: Create aStageTimingwith sub_stages, serialize to JSON, assert structure matches PRD example.test_stage_timing_zero_fields_omitted: CreateStageTimingwitherrors: 0, serialize, assert noerrorskey in output.test_metrics_layer_single_span: CreateMetricsLayer, enter/exit one span with recorded fields, extract timings, assert oneStageTimingentry.test_metrics_layer_nested_spans: Create parent + child spans, extract timings, assert parent has child insub_stages.test_metrics_layer_parallel_spans: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent.test_sync_meta_includes_stages: Mock a sync pipeline, verify robot JSON output parses correctly withmeta.stages.test_timing_summary_format: Capture stderr during interactive sync, verify timing table format matches PRD example.
6.4 Phase 4: Sync History Enrichment
Functional criteria:
lore synccreates a row insync_runswith status='running' at start, updated to 'succeeded'/'failed' at finish.lore ingest issuesalso creates async_runsrow.sync_runs.run_idmatches therun_idin log files and robot JSON.sync_runs.metrics_jsoncontains the serializedVec<StageTiming>.sync_runs.total_items_processedandtotal_errorsare populated.lore sync-statusshows the last 10 runs with: run_id, timestamp, duration, item count, error count.lore --robot sync-statusJSON includesrunsarray withstagesparsed frommetrics_json.- Failed syncs record the error message and any partial metrics collected before failure.
- Migration 014 applies cleanly and is idempotent (safe to re-run).
Test specifications:
test_sync_run_recorder_start: Callstart(), query sync_runs, assert one row with status='running'.test_sync_run_recorder_succeed: Callstart()thensucceed(), assert row has status='succeeded', finished_at set, metrics_json parseable.test_sync_run_recorder_fail: Callstart()thenfail(), assert row has status='failed', error set.test_sync_run_recorder_fail_with_partial_metrics: Callstart(), collect some metrics, thenfail(). Assert metrics_json contains partial data.test_migration_014_applies: Apply all migrations 001-014 on a fresh DB. Assertsync_runshasrun_id,total_items_processed,total_errorscolumns.test_migration_014_idempotent: Apply migration 014 twice. Assert no error on second apply.test_sync_status_shows_runs: Insert 3 sync_runs rows, runprint_sync_status(), assert output includes all 3 with correct formatting.test_sync_status_json_includes_stages: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assertruns[0].stagesis an array.
6.5 Phase 5: Rate Limit + Retry Instrumentation
Functional criteria:
- Rate-limit events (HTTP 429) log at INFO with structured fields:
path,attempt,retry_after_secs,status_code. - Retry events (non-429 transient errors) log with:
path,attempt,error. StageTimingincludesrate_limit_hitsandretriescounts (omitted when zero).lore -v syncshows retry activity on stderr (visible because it's INFO+).- Rate limit counts are included in
metrics_jsonstored insync_runs.
Test specifications:
test_rate_limit_log_fields: Simulate a 429 response, capture log output, parse JSON, assert fields:path,attempt,retry_after_secs,status_code.test_retry_log_fields: Simulate a transient error + retry, capture log, assert fields:path,attempt,error.test_stage_timing_rate_limit_counts: Simulate 3 rate-limit hits within a span, extractStageTiming, assertrate_limit_hits == 3.test_stage_timing_retry_counts: Simulate 2 retries, extractStageTiming, assertretries == 2.test_rate_limit_fields_omitted_when_zero: CreateStageTimingwith zero rate limits, serialize, assert norate_limit_hitskey.
7. Resolved Decisions
-
Log format: Use
tracing-subscriber's built-in JSON formatter (fmt::layer().json()). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter. -
Span recording: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit.
-
Log file location:
~/.local/share/lore/logs/(XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache. -
Retention: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching
lore.*.logolder thanretention_days(default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held. -
Stage timing granularity: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The
StageTimingstruct includes an optionalprojectfield and asub_stages: Vec<StageTiming>field for nesting. -
Stage timing collection mechanism: Materialized from tracing spans, not plumbed manually. A custom
MetricsLayerin the subscriber stack records span enter/exit/record events and builds theStageTimingtree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them. -
run_id format: First 8 characters of
Uuid::new_v4().to_string()(e.g.,"a1b2c3d4"). Theuuidcrate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations. -
File log level: Always DEBUG+ for
lore::*crate, WARN+ for dependencies, regardless of-vflags. This ensures post-mortem data is always richer than what was shown on stderr.RUST_LOGoverrides both layers when set. -
sync_runs lifecycle: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing
metrics_jsoncolumn stores the detailedVec<StageTiming>array — no need for a separatestages_jsoncolumn. -
JSON stderr via --log-format: A
--log-format text|jsonglobal flag controls stderr log format. Default istext(human-readable). Whenjson, stderr uses the same JSON formatter as the file layer, routed throughSuspendingWriterfor progress bar coordination. This enableslore sync 2>&1 | jqworkflows without reading log files.
8. Phase Dependency Graph
Phase 1 (Subscriber + Flags)
|
v
Phase 2 (Spans + run_id)
|
+------+------+
| |
v v
Phase 3 Phase 5
(Metrics) (Rate Limit Logging)
| |
v |
Phase 4 |
(Sync History) <--+
Parallelization opportunities:
- Phase 1 must complete before anything else.
- Phase 2 must complete before Phase 3 or Phase 5.
- Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3).
- Phase 4 depends on Phase 3 (needs
Vec<StageTiming>to store). Phase 5'srate_limit_hits/retriesfields onStageTimingcan be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially.
Agent assignment suggestion:
- Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure)
- Agent B: Phase 3 (after Phase 2 completes)
- Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3)
- Agent B or D: Phase 4 (after Phase 3 completes)
9. References
- Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489.
- Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951.
- tokio-rs/tracing ecosystem:
tracing,tracing-subscriber,tracing-appender. - GNU Coding Standards: Verbosity and diagnostic output conventions. Rust CLI Working Group: Recommendations for error reporting and verbosity.