docs: Add observability PRD and sync pipeline explorer visualization

- prd-observability.md: Product requirements document for the sync pipeline
  observability system, covering structured logging, metrics collection,
  sync run tracking, and robot-mode performance output
- gitlore-sync-explorer.html: Self-contained interactive HTML visualization
  for exploring sync pipeline stage timings and data flow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
teernisse
2026-02-04 10:01:28 -05:00
parent 86a51cddef
commit 38da7ca47b
2 changed files with 1710 additions and 0 deletions

866
docs/prd-observability.md Normal file
View File

@@ -0,0 +1,866 @@
# PRD: Observability Infrastructure for lore CLI
**Status:** Draft
**Author:** Taylor + Claude
**Date:** 2026-02-04
---
## 1. Problem Statement
lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond `RUST_LOG`, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available.
### Current State
| Capability | Status |
|---|---|
| Log destination | stderr only, ephemeral |
| Log persistence | None |
| Structured output | Human-readable fmt only |
| Verbosity control | `RUST_LOG` env var (no CLI flag) |
| Performance metrics | Ad-hoc `Instant::now()` in 2 commands |
| Timing in robot JSON | `elapsed_ms` in search and sync `meta` only |
| Spans / correlation | None |
| Log rotation | None |
| Per-stage timing | None |
| Rate limit / retry visibility | `tracing::warn!` only |
| Error aggregation | None |
| Historical comparison | None |
### What's Already in Place (to build on)
- `tracing` (0.1) + `tracing-subscriber` (0.3) with `env-filter` feature
- Registry-based subscriber initialized in `src/main.rs:44-58` with a single `fmt::layer()` using `SuspendingWriter`
- `SuspendingWriter` (`src/cli/progress.rs:25-73`) that coordinates log output with indicatif `MultiProgress` — buffers each log line, calls `MULTI.suspend()` on drop to clear progress bars before writing to stderr
- `IngestDisplay` struct (`src/cli/commands/ingest.rs:65-104`) controlling UI verbosity with three modes: `interactive()` / `silent()` / `progress_only()`
- Robot mode JSON envelope: `{ "ok": true, "data": {...}, "meta": {...} }` — used consistently in sync, search, sync-status, and doctor commands
- XDG-compliant data directory at `~/.local/share/lore/`
- `sync_runs` table (migration 001) with schema: `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json`**exists but is never written to** (no INSERT anywhere in the codebase; `sync_status.rs` reads from it but always gets zero rows)
- `uuid` crate (v1, v4 feature) already a dependency
- Structured fields used in tracing calls (e.g., `info!(owner = %self.owner, ...)`)
- `EnvFilter` currently hardcoded: `lore=info` + `warn` default directives
- Global CLI flags in `src/cli/mod.rs:9-43`: `--config`, `--robot`, `-J`, `--color`, `--quiet` (all `global = true`)
---
## 2. Goals
### Primary
1. **Post-mortem debugging**: Any failed or slow run can be diagnosed after the fact from persistent, structured log files.
2. **Performance visibility**: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage.
3. **Ergonomic verbosity**: Users and agents control log verbosity through CLI flags (`-v`, `-vv`, `-vvv`) without needing to know `RUST_LOG` syntax.
4. **Machine-parseable logs**: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.).
5. **Agent-friendly metrics**: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring.
### Secondary
6. **Log rotation and retention**: Log files don't grow unbounded; old logs are automatically cleaned up.
7. **Correlation IDs**: Every sync run gets a unique ID that connects log lines, database records, and robot output.
8. **Rate limit and retry transparency**: Every rate-limited request and retry is visible in logs with full context.
9. **Sync history with metrics**: The `sync_runs` table is enriched with per-stage timing, item counts, and error counts so `lore sync-status` becomes a real dashboard.
### Non-Goals
- External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1.
- Real-time log streaming / tailing UI.
- Alerting or notification systems.
- Distributed tracing across multiple lore instances.
---
## 3. Research Foundation
### 3.1 The Three Pillars of Observability
Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars:
1. **Logs** -- Discrete events with context. The foundation.
2. **Metrics** -- Numerical measurements over time (counters, gauges, histograms).
3. **Traces** -- Causally ordered spans representing operations.
For a CLI tool (not a long-running service), the mapping is:
| Pillar | CLI Equivalent |
|---|---|
| Logs | Structured log files per invocation |
| Metrics | Per-stage timing, item counts, error counts stored in DB |
| Traces | Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions) |
### 3.2 Structured Logging Best Practices
From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice:
- **Always structured**: JSON or key=value, never free-form prose in production logs.
- **Contextual fields propagate**: A sync_run_id set at the top level appears in every downstream log line.
- **Levels have semantic meaning**:
- `ERROR`: Operation failed, requires attention.
- `WARN`: Degraded behavior (rate limited, retry, skip).
- `INFO`: Significant state transitions (stage start/complete, items processed).
- `DEBUG`: Detailed operational data (page boundaries, individual API calls).
- `TRACE`: Wire-level detail (request/response bodies, SQL queries).
### 3.3 CLI Verbosity Conventions
From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo):
| Pattern | Meaning | Precedent |
|---|---|---|
| (default) | INFO for app, WARN for deps | cargo, rustc |
| `-q` / `--quiet` | Suppress non-error output | ripgrep, fd, cargo |
| `-v` | DEBUG for app | ripgrep, fd |
| `-vv` | DEBUG for app + deps | cargo |
| `-vvv` | TRACE for everything | cargo, curl |
| `RUST_LOG=...` | Fine-grained override | Universal in Rust |
The `-v` flag should feel familiar to anyone who has used cargo, curl, or ssh.
### 3.4 Log File Rotation
`tracing-appender` (from the tokio-rs/tracing ecosystem) provides:
- **Daily rotation**: New file per day, named `lore.2026-02-04.log`.
- **Non-blocking writes**: Dedicated writer thread, zero impact on main async runtime.
- **Configurable retention**: Delete files older than N days.
This is the canonical solution in the Rust tracing ecosystem and requires no custom code.
### 3.5 Performance Metrics for CLI Tools
Inspired by hyperfine's approach to benchmarking and cargo's `--timings` flag:
- Report wall-clock time per stage.
- Report item throughput (items/sec).
- Store historical runs for trend comparison.
- Present timing data in both human-readable and machine-readable formats.
---
## 4. Design
### 4.1 Architecture Overview
```
CLI Invocation
|
+----------+----------+
| |
Interactive Mode Robot Mode
| |
+---stderr (human fmt) stdout (JSON envelope)
| | |
| progress bars { ok, data, meta: {
| colored output elapsed_ms,
| stages: [...],
| run_id
| }}
|
Log Subscribers (layered)
|
+----+----+--------+
| | |
stderr file (future:
(fmt) (JSON) OTLP)
```
### 4.2 Subscriber Stack
Replace the current single-layer subscriber with a layered registry. Each layer has its own filter:
```
registry()
.with(stderr_layer.with_filter(stderr_filter)) // Human-readable, SuspendingWriter, -v controlled
.with(file_layer.with_filter(file_filter)) // JSON, daily rotation, always DEBUG+
```
**stderr layer**: Same `fmt::layer()` as today with `SuspendingWriter`, but level controlled by `-v` flags. When `--log-format json` is passed, this layer switches to `fmt::layer().json()` (same JSON format as file layer, but still routed through `SuspendingWriter` for progress bar coordination).
**file layer**: Always-on JSON output to `~/.local/share/lore/logs/`, daily rotation via `tracing-appender`. Uses its own `EnvFilter` set to `lore=debug,warn` regardless of `-v` flags, ensuring post-mortem data is always available. The file layer does NOT use `SuspendingWriter` — it writes to a file, not stderr, so progress bar coordination is unnecessary.
**Filter architecture**: Per-layer filtering (not a single shared `EnvFilter`) is required because the file layer must always be at DEBUG+ while stderr follows `-v`. `tracing-subscriber`'s `Layer::with_filter()` method enables this.
**`RUST_LOG` override**: When `RUST_LOG` is set, it overrides BOTH layer filters. This is the expert escape hatch.
**Current subscriber** (`src/main.rs:44-58`):
```rust
tracing_subscriber::registry()
.with(
tracing_subscriber::fmt::layer()
.with_target(false)
.with_writer(lore::cli::progress::SuspendingWriter),
)
.with(
EnvFilter::from_default_env()
.add_directive("lore=info".parse().unwrap())
.add_directive("warn".parse().unwrap()),
)
.init();
```
This will be replaced by the dual-layer setup. The `SuspendingWriter` integration and `with_target(false)` on the stderr layer remain unchanged.
### 4.3 Verbosity Levels
#### stderr layer (controlled by `-v` flags)
| Flags | App Level | Dep Level | Behavior |
|---|---|---|---|
| (none) | INFO | WARN | Default. Stage transitions, summaries. |
| `-q` | WARN | ERROR | Errors and warnings only. |
| `-v` | DEBUG | WARN | Detailed app behavior. API pages, skip reasons. |
| `-vv` | DEBUG | INFO | App + dependency detail. HTTP client, SQLite. |
| `-vvv` | TRACE | DEBUG | Everything. Wire-level detail. |
| `RUST_LOG=...` | (overrides all) | (overrides all) | Expert escape hatch. |
Precedence: `RUST_LOG` > `-v` flags > defaults. This matches cargo's behavior.
#### file layer (independent of `-v` flags)
| Condition | App Level | Dep Level |
|---|---|---|
| Always (default) | DEBUG | WARN |
| `RUST_LOG=...` set | (overrides) | (overrides) |
The file layer always captures DEBUG+ for the `lore` crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. `RUST_LOG` overrides both layers when set.
#### New CLI flags
Add to the `Cli` struct (`src/cli/mod.rs`):
```rust
/// Increase log verbosity (-v, -vv, -vvv)
#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)]
pub verbose: u8,
/// Log format for stderr output: text (default) or json
#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")]
pub log_format: String,
```
The `-v` flag uses `clap::ArgAction::Count` to support `-v`, `-vv`, `-vvv` as a single flag with increasing count. The `--log-format` flag controls whether stderr emits human-readable or JSON-formatted log lines.
### 4.4 Structured Log File Output
**Location**: `~/.local/share/lore/logs/lore.YYYY-MM-DD.log`
**Format**: One JSON object per line (JSONL), produced by `tracing-subscriber`'s `fmt::layer().json()`:
```json
{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}}
```
**Rotation**: Daily via `tracing-appender::rolling::daily()`.
**Retention**: Configurable, default 30 days. A `logs.retention_days` config field. Cleanup runs at startup (check directory, delete files older than N days).
### 4.5 Tracing Spans
Introduce spans for causal correlation within a single invocation:
```
sync (run_id=uuid)
+-- ingest_issues
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
| +-- fetch_resource_events (project="group/repo")
+-- ingest_mrs
| +-- fetch_pages (project="group/repo")
| +-- sync_discussions (project="group/repo")
+-- generate_docs
+-- embed
```
Each span records `elapsed_ms` on close. The `run_id` propagates to all child spans and log events, enabling `jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log` to extract an entire run.
### 4.6 Performance Metrics
#### 4.6.1 Per-Stage Timing
Every command collects a `Vec<StageTiming>`:
```rust
#[derive(Debug, Clone, Serialize)]
pub struct StageTiming {
pub name: String, // "ingest_issues", "fetch_pages", etc.
#[serde(skip_serializing_if = "Option::is_none")]
pub project: Option<String>, // Which project, if applicable
pub elapsed_ms: u64,
pub items_processed: usize,
#[serde(skip_serializing_if = "is_zero")]
pub items_skipped: usize,
#[serde(skip_serializing_if = "is_zero")]
pub errors: usize,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub sub_stages: Vec<StageTiming>, // Nested child stages
}
```
**Collection mechanism**: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds `#[instrument]` spans to each sync stage. Phase 3 adds a custom `tracing-subscriber` layer that records span enter/exit times and structured fields, then extracts the span tree into `Vec<StageTiming>` when the root span closes.
This means:
- No mutable timing collector threaded through `run_ingest``fetch_pages``sync_discussions`
- Spans are the single source of truth for timing
- `StageTiming` is a materialized view of the span tree
- The custom layer implements `on_close` to capture `elapsed` and `on_record` to capture structured fields like `items_processed`
**Where to define**: `src/core/metrics.rs` (new file — genuinely new functionality that doesn't fit in any existing file)
#### 4.6.2 Robot JSON Meta Enhancement
Currently:
```json
{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } }
```
Proposed:
```json
{
"ok": true,
"data": { ... },
"meta": {
"run_id": "a1b2c3d4",
"elapsed_ms": 45230,
"stages": [
{
"name": "ingest_issues",
"elapsed_ms": 12340,
"items_processed": 150,
"items_skipped": 30,
"errors": 0,
"sub_stages": [
{ "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 },
{ "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 }
]
},
{
"name": "ingest_mrs",
"elapsed_ms": 18900,
"items_processed": 85,
"items_skipped": 12,
"errors": 1
},
{ "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 },
{ "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 }
]
}
}
```
#### 4.6.3 Sync History Enrichment
**Prerequisite bug fix**: The `sync_runs` table (migration 001) exists with columns `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — but **no code ever writes to it**. The `sync_status.rs` command reads from it but always gets zero rows. This must be fixed before enrichment.
**Step 1: Wire up sync_runs lifecycle** (prerequisite, in Phase 4)
Add INSERT/UPDATE calls to the sync and ingest command handlers:
```rust
// On sync/ingest start:
INSERT INTO sync_runs (started_at, heartbeat_at, status, command)
VALUES (?now_ms, ?now_ms, 'running', ?command_name)
RETURNING id;
// On sync/ingest success:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics
WHERE id = ?run_id;
// On sync/ingest failure:
UPDATE sync_runs
SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics
WHERE id = ?run_id;
```
**Where**: Add a `SyncRunRecorder` helper in `src/core/db.rs` or `src/core/sync_run.rs` that encapsulates the INSERT/UPDATE lifecycle. Called from `run_sync()` in `src/cli/commands/sync.rs` and `run_ingest()` in `src/cli/commands/ingest.rs`.
**Step 2: Schema migration** (migration 014)
Add dedicated queryable columns alongside the existing `metrics_json`:
```sql
-- Migration 014: sync_runs enrichment for observability
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
-- Index for correlation queries
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
```
The existing `metrics_json` column stores the detailed `Vec<StageTiming>` as a JSON array. No need for a separate `stages_json` column.
**Step 3: Enhanced sync-status display**
`lore sync-status` (`src/cli/commands/sync_status.rs`) currently shows only the last run. Enhance to show recent runs with metrics:
```
Recent sync runs:
Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error
Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors
Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors
```
Robot mode (`lore --robot sync-status`):
```json
{
"ok": true,
"data": {
"runs": [
{
"run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d",
"started_at": "2026-02-04T14:32:01.123Z",
"elapsed_ms": 45230,
"status": "succeeded",
"command": "sync",
"total_items_processed": 235,
"total_errors": 1,
"stages": [...]
}
],
"cursors": [...],
"summary": {...}
}
}
```
The `stages` array is parsed from `metrics_json` and included in the robot output. Interactive mode shows the summary table above; `lore --robot sync-status --run a1b2c3` shows a single run's full stage breakdown.
#### 4.6.4 Human-Readable Timing
At the end of `lore sync` (interactive mode), print a timing summary:
```
Sync complete in 45.2s
Ingest issues .... 12.3s (150 items, 42 discussions)
Ingest MRs ....... 18.9s (85 items, 1 error)
Generate docs .... 8.5s (235 documents)
Embed ............ 5.5s (1024 chunks)
```
Gated behind `display.show_text` so it doesn't appear in progress_only or silent modes.
### 4.7 Rate Limit and Retry Transparency
Currently, rate limits emit a `tracing::warn!`. Enhance to:
- Log at INFO level (not just WARN) with structured fields: `info!(path, attempt, retry_after_secs, "Rate limited, retrying")`.
- Count total rate-limit hits per run and include in stage timing.
- In `-v` mode, show retry progress on stderr: ` Retrying /api/v4/projects/123/issues (429, waiting 2s)`.
### 4.8 Configuration
Add a new `logging` section to `Config` (`src/core/config.rs`):
```rust
#[derive(Debug, Clone, Deserialize)]
pub struct LoggingConfig {
/// Directory for log files. Default: ~/.local/share/lore/logs/
#[serde(default)]
pub log_dir: Option<String>,
/// Days to retain log files. Default: 30. Set to 0 to disable file logging.
#[serde(default = "default_retention_days")]
pub retention_days: u32,
/// Enable JSON log files. Default: true.
#[serde(default = "default_true")]
pub file_logging: bool,
}
fn default_retention_days() -> u32 { 30 }
fn default_true() -> bool { true }
```
Add to the `Config` struct:
```rust
#[serde(default)]
pub logging: LoggingConfig,
```
With `config.json`:
```json
{
"logging": {
"log_dir": null,
"retention_days": 30,
"file_logging": true
}
}
```
Defaults are sane so existing configs continue working with zero changes.
**CLI flags** (added to `Cli` struct in `src/cli/mod.rs`):
| Flag | Type | Default | Description |
|---|---|---|---|
| `-v` / `--verbose` | count (u8) | 0 | Increase stderr log verbosity. Stacks: `-v`, `-vv`, `-vvv`. |
| `--log-format` | text \| json | text | Stderr log format. `json` emits one JSON object per log line (same schema as file layer). |
These are global flags (`global = true`) consistent with the existing `--quiet`, `--robot`, etc.
---
## 5. Implementation Plan
### Phase 1: Verbosity Flags + Structured File Logging
**Scope**: CLI flags, dual-layer subscriber, file logging, rotation, retention, `--log-format`.
**Files touched**:
- `Cargo.toml` — add `tracing-appender` dependency
- `src/cli/mod.rs` — add `-v`/`--verbose` (count) and `--log-format` flags to `Cli` struct
- `src/main.rs` — replace subscriber initialization (lines 44-58) with dual-layer setup
- `src/core/config.rs` — add `LoggingConfig` struct and `logging` field to `Config`
- `src/core/paths.rs` — add `get_log_dir()` helper (XDG data dir + `/logs/`)
- `src/cli/commands/doctor.rs` — add log file location and disk usage check
**Implementation steps**:
1. Add `-v` / `--verbose` (count, `u8`) and `--log-format` (text|json) flags to `Cli` struct.
2. Add `tracing-appender` dependency to `Cargo.toml`.
3. Add `LoggingConfig` to `Config` with `#[serde(default)]`.
4. Add `get_log_dir()` to `src/core/paths.rs` (mirrors `get_db_path()` pattern).
5. Replace subscriber init in `main.rs`:
- Build `stderr_filter` from `-v` count (or `RUST_LOG` if set).
- Build `file_filter` as `lore=debug,warn` (or `RUST_LOG` if set).
- stderr layer: `fmt::layer().with_writer(SuspendingWriter)` with `stderr_filter`. When `--log-format json`, chain `.json()`.
- file layer: `fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore"))` with `file_filter`.
- Combine via `registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter))`.
6. Implement log retention at startup: scan `log_dir`, delete files matching `lore.*.log` pattern older than `retention_days`. Run before subscriber init so deleted files aren't held open.
7. Add log file check to `lore doctor`: report log directory path, number of log files, total disk usage. In robot mode, add a `logging` field to `DoctorChecks` with `log_dir`, `file_count`, `total_bytes`, `oldest_file`.
**New dependencies**: `tracing-appender` (0.2)
**Interaction with `-q`/`--quiet`**: The existing `--quiet` flag suppresses non-error terminal output via `IngestDisplay::silent()`. It should NOT affect file logging (file layer is always on). When `-q` and `-v` are both passed, `-q` wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+.
**Tests** (see Section 6.1 for details):
- Unit: `EnvFilter` construction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE)
- Unit: `RUST_LOG` overrides `-v` flags
- Unit: `-q` + `-v` interaction (quiet wins)
- Unit: `LoggingConfig` deserialization with missing/partial/full fields
- Unit: Log retention deletes old files, preserves recent ones
- Integration: Subscriber produces JSON lines to a test file
- Integration: `SuspendingWriter` still works with dual-layer stack (no garbled output)
### Phase 2: Spans + Correlation IDs
**Scope**: Tracing spans, UUID-based `run_id`, span recording for JSON logs.
**Depends on**: Phase 1 (subscriber must support span recording).
**Files touched**:
- `src/cli/commands/sync.rs` — add root span with `run_id` field to `run_sync()`
- `src/cli/commands/ingest.rs` — add `#[instrument]` spans to `run_ingest()` and its stages
- `src/ingestion/orchestrator.rs` — add spans for `fetch_pages`, `sync_discussions`, `fetch_resource_events`
- `src/documents/regenerator.rs` — add span for `generate_docs` stage
- `src/embedding/pipeline.rs` — add span for `embed` stage
- `src/main.rs` — generate `run_id` before calling command handler, pass as field
**Implementation steps**:
1. Generate `run_id` using `Uuid::new_v4().to_string()[..8]` (first 8 chars of UUIDv4) at command entry in `main.rs`. No new dependency needed — `uuid` v1 with v4 feature is already in `Cargo.toml`.
2. Create root span: `let _root = tracing::info_span!("sync", run_id = %run_id).entered();` (or equivalent for each command).
3. Add `#[instrument(skip_all, fields(stage = "ingest_issues"))]` to ingest stages.
4. Add `#[instrument(skip_all, fields(project = %project_path))]` to per-project functions.
5. Ensure the file layer's JSON formatter includes span context. `tracing-subscriber`'s `fmt::layer().json()` includes the current span chain by default when the registry has span storage enabled.
6. Verify: parse a log file, confirm every line includes `span.run_id`.
**New dependencies**: None (`uuid` already present).
**Tests**:
- Unit: `run_id` is a valid 8-character hex string
- Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains `run_id` in span context
- Integration: Nested spans produce correct parent-child relationships in JSON output
### Phase 3: Performance Metrics Collection
**Scope**: `StageTiming` struct, span-to-metrics extraction, robot JSON enrichment, timing summary.
**Depends on**: Phase 2 (spans must exist to extract timing from).
**Files touched**:
- `src/core/metrics.rs` — new file: `StageTiming` struct, `MetricsLayer` (custom tracing layer), span-to-timing extraction
- `src/cli/commands/sync.rs` — consume `Vec<StageTiming>` from `MetricsLayer`, include in `SyncMeta`
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
- `src/main.rs` — register `MetricsLayer` in the subscriber stack
**Implementation steps**:
1. Define `StageTiming` struct with `sub_stages: Vec<StageTiming>` in `src/core/metrics.rs`.
2. Implement `MetricsLayer` as a custom `tracing_subscriber::Layer`:
- `on_new_span`: Record span ID, name, parent, start time.
- `on_record`: Capture structured fields (`items_processed`, `items_skipped`, `errors`) recorded via `Span::record()`.
- `on_close`: Calculate `elapsed_ms`, build `StageTiming` entry, attach to parent.
- Provide `fn extract_timings(&self, run_id: &str) -> Vec<StageTiming>` to materialize the span tree after the root span closes.
3. Store `MetricsLayer` reference (behind `Arc`) so command handlers can call `extract_timings()` after `run_sync()` completes.
4. Extend `SyncMeta` and `SyncJsonOutput` to include `run_id: String` and `stages: Vec<StageTiming>`.
5. Print human-readable timing summary at end of interactive sync (gated behind `IngestDisplay::show_text`).
**Span field recording**: Sync stages must record item counts as span fields for `MetricsLayer` to capture:
```rust
let span = tracing::info_span!("ingest_issues");
let _guard = span.enter();
// ... do work ...
span.record("items_processed", count);
span.record("items_skipped", skipped);
```
**Tests**:
- Unit: `StageTiming` serialization matches expected JSON (including nested `sub_stages`)
- Unit: `MetricsLayer` correctly builds span tree from synthetic span events
- Unit: `MetricsLayer` handles spans with no children (leaf stages like `embed`)
- Unit: `MetricsLayer` handles concurrent spans (multiple projects in parallel)
- Integration: `lore --robot sync` output includes `meta.stages` array with correct nesting
- Integration: Interactive sync prints timing summary table to stderr
### Phase 4: Sync History Enrichment
**Scope**: Wire up `sync_runs` INSERT/UPDATE lifecycle, schema migration, enhanced sync-status.
**Depends on**: Phase 3 (needs `Vec<StageTiming>` to store in `metrics_json`).
**Files touched**:
- `migrations/014_sync_runs_enrichment.sql` — new migration: add `run_id`, `total_items_processed`, `total_errors` columns + index
- `src/core/sync_run.rs` — new file: `SyncRunRecorder` struct encapsulating INSERT on start, UPDATE on finish
- `src/cli/commands/sync.rs` — create `SyncRunRecorder` before pipeline, finalize after
- `src/cli/commands/ingest.rs` — same pattern for standalone ingest
- `src/cli/commands/sync_status.rs` — enhance to show recent runs with metrics, parse `metrics_json`
**Implementation steps**:
1. Create migration `014_sync_runs_enrichment.sql`:
```sql
ALTER TABLE sync_runs ADD COLUMN run_id TEXT;
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id);
```
Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly.
2. Implement `SyncRunRecorder`:
```rust
pub struct SyncRunRecorder { id: i64, conn: Connection }
impl SyncRunRecorder {
pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result<Self>;
pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>;
pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>;
}
```
3. In `run_sync()`: create `SyncRunRecorder::start()` before pipeline, call `.succeed()` or `.fail()` after.
4. In `run_ingest()`: same pattern.
5. Enhance `sync_status.rs`:
- Query last N runs (default 10) instead of just the last 1.
- Parse `metrics_json` column to extract stage breakdown.
- Show `run_id`, duration, item counts, error counts in both interactive and robot modes.
- Add `--run <run_id>` flag to `sync-status` for single-run detail view.
**Tests**:
- Unit: `SyncRunRecorder::start` inserts a row with status='running'
- Unit: `SyncRunRecorder::succeed` updates status, sets finished_at, writes metrics_json
- Unit: `SyncRunRecorder::fail` updates status, sets error, sets finished_at
- Unit: Migration 014 applies cleanly on top of migration 013
- Integration: `lore sync` creates a sync_runs row; `lore sync-status` displays it
- Integration: `lore --robot sync-status` JSON includes `runs` array with stage breakdowns
- Integration: Failed sync records error in sync_runs with partial metrics
### Phase 5: Rate Limit + Retry Instrumentation
**Scope**: Enhanced logging in GitLab client, retry counters in stage timing.
**Depends on**: Phase 2 (spans for context), Phase 3 (StageTiming for counters).
**Files touched**:
- `src/gitlab/client.rs` (or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log events
- `src/core/metrics.rs` — add `rate_limit_hits` and `retries` fields to `StageTiming`
**Implementation steps**:
1. Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields:
```rust
info!(
path = %request_path,
attempt = attempt_number,
retry_after_secs = retry_after,
status_code = 429,
"Rate limited, retrying"
);
```
2. Add `rate_limit_hits: usize` and `retries: usize` fields to `StageTiming` (with `#[serde(skip_serializing_if = "is_zero")]`).
3. In `MetricsLayer`, count rate-limit and retry events within each span and include in `StageTiming`.
4. In `-v` mode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1.
**Tests**:
- Unit: Rate-limit log events include all required structured fields
- Unit: `StageTiming` serialization includes `rate_limit_hits` and `retries` when non-zero, omits when zero
- Integration: Simulate 429 response, verify log line has `path`, `attempt`, `retry_after_secs` fields
- Integration: After simulated retries, `StageTiming` counts match expected values
---
## 6. Acceptance Criteria
### 6.1 Phase 1: Verbosity Flags + Structured File Logging
**Functional criteria**:
- [ ] `lore sync` writes JSON log lines to `~/.local/share/lore/logs/lore.YYYY-MM-DD.log` with zero configuration.
- [ ] `lore -v sync` shows DEBUG-level `lore::*` output on stderr; dependency output stays at WARN.
- [ ] `lore -vv sync` shows DEBUG-level `lore::*` + INFO-level dependency output on stderr.
- [ ] `lore -vvv sync` shows TRACE-level output for everything on stderr.
- [ ] `RUST_LOG=lore::gitlab=trace lore sync` overrides `-v` flags for both stderr and file layers.
- [ ] `lore --log-format json sync` emits JSON-formatted log lines on stderr (same schema as file layer).
- [ ] Log files rotate daily (new file per calendar day).
- [ ] Files matching `lore.*.log` older than `retention_days` are deleted on startup.
- [ ] Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format).
- [ ] `--quiet` suppresses non-error stderr output. `-q` + `-v` together: `-q` wins (stderr at WARN+).
- [ ] `--quiet` does NOT affect file logging (file layer remains at DEBUG+).
- [ ] `lore doctor` reports: log directory path, number of log files, total disk usage in bytes. Robot mode includes a `logging` field in the checks JSON.
- [ ] File layer always logs at DEBUG+ for `lore::*` crate regardless of `-v` flags.
**Test specifications**:
- `test_verbosity_filter_construction`: Given verbosity count 0/1/2/3, assert the resulting `EnvFilter` matches the expected directives table.
- `test_rust_log_overrides_verbose`: Set `RUST_LOG=lore=trace`, pass `-v` (count=1), assert the filter uses TRACE (not DEBUG).
- `test_quiet_overrides_verbose`: Pass `-q` and `-v` together, assert stderr filter is WARN+.
- `test_logging_config_defaults`: Deserialize an empty `{}` JSON as `LoggingConfig`, assert `retention_days=30`, `file_logging=true`, `log_dir=None`.
- `test_logging_config_partial`: Deserialize `{"retention_days": 7}`, assert `file_logging=true` default preserved.
- `test_log_retention_cleanup`: Create temp dir with files named `lore.2026-01-01.log` through `lore.2026-02-04.log`. Run retention with `retention_days=7`. Assert files older than 7 days are deleted, recent files preserved.
- `test_log_retention_ignores_non_log_files`: Create temp dir with `lore.2026-01-01.log` and `other.txt`. Run retention. Assert `other.txt` is NOT deleted.
- `test_json_log_output_format`: Capture file layer output, parse each line as JSON, assert keys: `timestamp`, `level`, `target`, `fields`, `span`.
- `test_suspending_writer_dual_layer`: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines).
### 6.2 Phase 2: Spans + Correlation IDs
**Functional criteria**:
- [ ] Every log line within a sync run includes `run_id` in the JSON span context.
- [ ] `jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.log` extracts all lines from a run.
- [ ] Nested spans produce a chain: log lines inside `fetch_pages` include both the `fetch_pages` span and the parent `ingest_issues` span in their span context.
- [ ] `run_id` is an 8-character hex string (truncated UUIDv4).
- [ ] Spans are visible in `-vv` stderr output as bracketed context.
**Test specifications**:
- `test_run_id_format`: Generate 100 run_ids, assert each is 8 chars, all hex characters.
- `test_run_id_uniqueness`: Generate 1000 run_ids, assert no duplicates.
- `test_span_context_in_json_logs`: Run a mock sync with spans, capture JSON log output, parse and verify each line has `spans` array containing `run_id`.
- `test_nested_span_chain`: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names.
- `test_span_elapsed_on_close`: Create a span, sleep 10ms, close it. Verify the close event records `elapsed_ms >= 10`.
### 6.3 Phase 3: Performance Metrics Collection
**Functional criteria**:
- [ ] `lore --robot sync` JSON includes `meta.run_id` (string) and `meta.stages` (array).
- [ ] Each stage in `meta.stages` has: `name`, `elapsed_ms`, `items_processed`.
- [ ] Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have `sub_stages` arrays.
- [ ] Sub-stages include `project` field when applicable.
- [ ] `lore sync` (interactive) prints a timing summary table on stderr, gated behind `IngestDisplay::show_text`.
- [ ] `lore -q sync` does NOT print the timing summary.
- [ ] Zero-value fields (`items_skipped: 0`, `errors: 0`) are omitted from JSON output.
**Test specifications**:
- `test_stage_timing_serialization`: Create a `StageTiming` with sub_stages, serialize to JSON, assert structure matches PRD example.
- `test_stage_timing_zero_fields_omitted`: Create `StageTiming` with `errors: 0`, serialize, assert no `errors` key in output.
- `test_metrics_layer_single_span`: Create `MetricsLayer`, enter/exit one span with recorded fields, extract timings, assert one `StageTiming` entry.
- `test_metrics_layer_nested_spans`: Create parent + child spans, extract timings, assert parent has child in `sub_stages`.
- `test_metrics_layer_parallel_spans`: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent.
- `test_sync_meta_includes_stages`: Mock a sync pipeline, verify robot JSON output parses correctly with `meta.stages`.
- `test_timing_summary_format`: Capture stderr during interactive sync, verify timing table format matches PRD example.
### 6.4 Phase 4: Sync History Enrichment
**Functional criteria**:
- [ ] `lore sync` creates a row in `sync_runs` with status='running' at start, updated to 'succeeded'/'failed' at finish.
- [ ] `lore ingest issues` also creates a `sync_runs` row.
- [ ] `sync_runs.run_id` matches the `run_id` in log files and robot JSON.
- [ ] `sync_runs.metrics_json` contains the serialized `Vec<StageTiming>`.
- [ ] `sync_runs.total_items_processed` and `total_errors` are populated.
- [ ] `lore sync-status` shows the last 10 runs with: run_id, timestamp, duration, item count, error count.
- [ ] `lore --robot sync-status` JSON includes `runs` array with `stages` parsed from `metrics_json`.
- [ ] Failed syncs record the error message and any partial metrics collected before failure.
- [ ] Migration 014 applies cleanly and is idempotent (safe to re-run).
**Test specifications**:
- `test_sync_run_recorder_start`: Call `start()`, query sync_runs, assert one row with status='running'.
- `test_sync_run_recorder_succeed`: Call `start()` then `succeed()`, assert row has status='succeeded', finished_at set, metrics_json parseable.
- `test_sync_run_recorder_fail`: Call `start()` then `fail()`, assert row has status='failed', error set.
- `test_sync_run_recorder_fail_with_partial_metrics`: Call `start()`, collect some metrics, then `fail()`. Assert metrics_json contains partial data.
- `test_migration_014_applies`: Apply all migrations 001-014 on a fresh DB. Assert `sync_runs` has `run_id`, `total_items_processed`, `total_errors` columns.
- `test_migration_014_idempotent`: Apply migration 014 twice. Assert no error on second apply.
- `test_sync_status_shows_runs`: Insert 3 sync_runs rows, run `print_sync_status()`, assert output includes all 3 with correct formatting.
- `test_sync_status_json_includes_stages`: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assert `runs[0].stages` is an array.
### 6.5 Phase 5: Rate Limit + Retry Instrumentation
**Functional criteria**:
- [ ] Rate-limit events (HTTP 429) log at INFO with structured fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
- [ ] Retry events (non-429 transient errors) log with: `path`, `attempt`, `error`.
- [ ] `StageTiming` includes `rate_limit_hits` and `retries` counts (omitted when zero).
- [ ] `lore -v sync` shows retry activity on stderr (visible because it's INFO+).
- [ ] Rate limit counts are included in `metrics_json` stored in `sync_runs`.
**Test specifications**:
- `test_rate_limit_log_fields`: Simulate a 429 response, capture log output, parse JSON, assert fields: `path`, `attempt`, `retry_after_secs`, `status_code`.
- `test_retry_log_fields`: Simulate a transient error + retry, capture log, assert fields: `path`, `attempt`, `error`.
- `test_stage_timing_rate_limit_counts`: Simulate 3 rate-limit hits within a span, extract `StageTiming`, assert `rate_limit_hits == 3`.
- `test_stage_timing_retry_counts`: Simulate 2 retries, extract `StageTiming`, assert `retries == 2`.
- `test_rate_limit_fields_omitted_when_zero`: Create `StageTiming` with zero rate limits, serialize, assert no `rate_limit_hits` key.
---
## 7. Resolved Decisions
1. **Log format**: Use `tracing-subscriber`'s built-in JSON formatter (`fmt::layer().json()`). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter.
2. **Span recording**: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit.
3. **Log file location**: `~/.local/share/lore/logs/` (XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache.
4. **Retention**: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching `lore.*.log` older than `retention_days` (default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held.
5. **Stage timing granularity**: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The `StageTiming` struct includes an optional `project` field and a `sub_stages: Vec<StageTiming>` field for nesting.
6. **Stage timing collection mechanism**: Materialized from tracing spans, not plumbed manually. A custom `MetricsLayer` in the subscriber stack records span enter/exit/record events and builds the `StageTiming` tree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them.
7. **run_id format**: First 8 characters of `Uuid::new_v4().to_string()` (e.g., `"a1b2c3d4"`). The `uuid` crate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations.
8. **File log level**: Always DEBUG+ for `lore::*` crate, WARN+ for dependencies, regardless of `-v` flags. This ensures post-mortem data is always richer than what was shown on stderr. `RUST_LOG` overrides both layers when set.
9. **sync_runs lifecycle**: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing `metrics_json` column stores the detailed `Vec<StageTiming>` array — no need for a separate `stages_json` column.
10. **JSON stderr via --log-format**: A `--log-format text|json` global flag controls stderr log format. Default is `text` (human-readable). When `json`, stderr uses the same JSON formatter as the file layer, routed through `SuspendingWriter` for progress bar coordination. This enables `lore sync 2>&1 | jq` workflows without reading log files.
---
## 8. Phase Dependency Graph
```
Phase 1 (Subscriber + Flags)
|
v
Phase 2 (Spans + run_id)
|
+------+------+
| |
v v
Phase 3 Phase 5
(Metrics) (Rate Limit Logging)
| |
v |
Phase 4 |
(Sync History) <--+
```
**Parallelization opportunities**:
- Phase 1 must complete before anything else.
- Phase 2 must complete before Phase 3 or Phase 5.
- Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3).
- Phase 4 depends on Phase 3 (needs `Vec<StageTiming>` to store). Phase 5's `rate_limit_hits`/`retries` fields on `StageTiming` can be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially.
**Agent assignment suggestion**:
- Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure)
- Agent B: Phase 3 (after Phase 2 completes)
- Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3)
- Agent B or D: Phase 4 (after Phase 3 completes)
---
## 9. References
- Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489.
- Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951.
- tokio-rs/tracing ecosystem: `tracing`, `tracing-subscriber`, `tracing-appender`.
- GNU Coding Standards: Verbosity and diagnostic output conventions.
Rust CLI Working Group: Recommendations for error reporting and verbosity.

844
gitlore-sync-explorer.html Normal file
View File

@@ -0,0 +1,844 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Gitlore Sync Pipeline Explorer</title>
<style>
:root {
--bg: #0d1117;
--bg-secondary: #161b22;
--bg-tertiary: #1c2129;
--border: #30363d;
--text: #c9d1d9;
--text-dim: #8b949e;
--text-bright: #f0f6fc;
--cyan: #58a6ff;
--green: #3fb950;
--amber: #d29922;
--red: #f85149;
--purple: #bc8cff;
--pink: #f778ba;
--cyan-dim: rgba(88,166,255,0.15);
--green-dim: rgba(63,185,80,0.15);
--amber-dim: rgba(210,153,34,0.15);
--red-dim: rgba(248,81,73,0.15);
--purple-dim: rgba(188,140,255,0.15);
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: 'SF Mono', 'Cascadia Code', 'Fira Code', 'JetBrains Mono', monospace;
background: var(--bg); color: var(--text);
display: flex; height: 100vh; overflow: hidden;
}
.sidebar {
width: 220px; min-width: 220px; background: var(--bg-secondary);
border-right: 1px solid var(--border); display: flex; flex-direction: column; padding: 16px 0;
}
.sidebar-title {
font-size: 11px; font-weight: 700; text-transform: uppercase;
letter-spacing: 1.2px; color: var(--text-dim); padding: 0 16px 12px;
}
.logo {
padding: 0 16px 20px; font-size: 15px; font-weight: 700; color: var(--cyan);
display: flex; align-items: center; gap: 8px;
}
.logo svg { width: 20px; height: 20px; }
.nav-item {
padding: 10px 16px; cursor: pointer; font-size: 13px; color: var(--text-dim);
transition: all 0.15s; border-left: 3px solid transparent;
display: flex; align-items: center; gap: 10px;
}
.nav-item:hover { background: var(--bg-tertiary); color: var(--text); }
.nav-item.active { background: var(--cyan-dim); color: var(--cyan); border-left-color: var(--cyan); }
.nav-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
.main { flex: 1; display: flex; flex-direction: column; overflow: hidden; }
.header {
padding: 16px 24px; border-bottom: 1px solid var(--border);
display: flex; align-items: center; justify-content: space-between;
}
.header h1 { font-size: 16px; font-weight: 600; color: var(--text-bright); }
.header-badge {
font-size: 11px; padding: 3px 10px; border-radius: 12px;
background: var(--cyan-dim); color: var(--cyan);
}
.canvas-wrapper { flex: 1; overflow: auto; position: relative; }
.canvas { padding: 32px; min-height: 100%; }
.flow-container { display: none; }
.flow-container.active { display: block; }
.phase { margin-bottom: 32px; }
.phase-header { display: flex; align-items: center; gap: 12px; margin-bottom: 16px; }
.phase-number {
width: 28px; height: 28px; border-radius: 50%; display: flex; align-items: center;
justify-content: center; font-size: 13px; font-weight: 700; flex-shrink: 0;
}
.phase-title { font-size: 14px; font-weight: 600; color: var(--text-bright); }
.phase-subtitle { font-size: 11px; color: var(--text-dim); margin-left: 4px; font-weight: 400; }
.flow-row {
display: flex; align-items: stretch; gap: 0; flex-wrap: wrap;
margin-left: 14px; padding-left: 26px; border-left: 2px solid var(--border);
}
.flow-row:last-child { border-left-color: transparent; }
.node {
position: relative; padding: 12px 16px; border-radius: 8px;
border: 1px solid var(--border); background: var(--bg-secondary);
font-size: 12px; cursor: pointer; transition: all 0.2s;
min-width: 180px; max-width: 260px; margin: 4px 0;
}
.node:hover {
border-color: var(--cyan); transform: translateY(-1px);
box-shadow: 0 4px 12px rgba(0,0,0,0.3);
}
.node.selected {
border-color: var(--cyan);
box-shadow: 0 0 0 1px var(--cyan), 0 4px 16px rgba(88,166,255,0.15);
}
.node-title { font-weight: 600; font-size: 12px; margin-bottom: 4px; color: var(--text-bright); }
.node-desc { font-size: 11px; color: var(--text-dim); line-height: 1.5; }
.node.api { border-left: 3px solid var(--cyan); }
.node.transform { border-left: 3px solid var(--purple); }
.node.db { border-left: 3px solid var(--green); }
.node.decision { border-left: 3px solid var(--amber); }
.node.error { border-left: 3px solid var(--red); }
.node.queue { border-left: 3px solid var(--pink); }
.arrow {
display: flex; align-items: center; padding: 0 6px;
color: var(--text-dim); font-size: 16px; flex-shrink: 0;
}
.arrow-down {
display: flex; justify-content: center; padding: 4px 0;
color: var(--text-dim); font-size: 16px; margin-left: 14px;
padding-left: 26px; border-left: 2px solid var(--border);
}
.branch-container {
margin-left: 14px; padding-left: 26px;
border-left: 2px solid var(--border); padding-bottom: 8px;
}
.branch-row { display: flex; gap: 12px; margin: 8px 0; flex-wrap: wrap; }
.branch-label {
font-size: 11px; font-weight: 600; margin: 8px 0 4px;
display: flex; align-items: center; gap: 6px;
}
.branch-label.success { color: var(--green); }
.branch-label.error { color: var(--red); }
.branch-label.retry { color: var(--amber); }
.diff-badge {
display: inline-block; font-size: 10px; padding: 2px 6px;
border-radius: 4px; margin-top: 6px; font-weight: 600;
}
.diff-badge.changed { background: var(--amber-dim); color: var(--amber); }
.diff-badge.same { background: var(--green-dim); color: var(--green); }
.detail-panel {
position: fixed; right: 0; top: 0; bottom: 0; width: 380px;
background: var(--bg-secondary); border-left: 1px solid var(--border);
transform: translateX(100%); transition: transform 0.25s ease;
z-index: 100; display: flex; flex-direction: column; overflow: hidden;
}
.detail-panel.open { transform: translateX(0); }
.detail-header {
padding: 16px 20px; border-bottom: 1px solid var(--border);
display: flex; align-items: center; justify-content: space-between;
}
.detail-header h2 { font-size: 14px; font-weight: 600; color: var(--text-bright); }
.detail-close {
cursor: pointer; color: var(--text-dim); font-size: 18px;
background: none; border: none; padding: 4px 8px; border-radius: 4px;
}
.detail-close:hover { background: var(--bg-tertiary); color: var(--text); }
.detail-body { flex: 1; overflow-y: auto; padding: 20px; }
.detail-section { margin-bottom: 20px; }
.detail-section h3 {
font-size: 11px; text-transform: uppercase; letter-spacing: 0.8px;
color: var(--text-dim); margin-bottom: 8px;
}
.detail-section p { font-size: 12px; line-height: 1.7; color: var(--text); }
.sql-block {
background: var(--bg); border: 1px solid var(--border); border-radius: 6px;
padding: 12px; font-size: 11px; line-height: 1.6; color: var(--green);
overflow-x: auto; white-space: pre; margin-top: 8px;
}
.detail-tag {
display: inline-block; font-size: 10px; padding: 2px 8px;
border-radius: 10px; margin: 2px 4px 2px 0;
}
.detail-tag.file { background: var(--purple-dim); color: var(--purple); }
.detail-tag.type-api { background: var(--cyan-dim); color: var(--cyan); }
.detail-tag.type-db { background: var(--green-dim); color: var(--green); }
.detail-tag.type-transform { background: var(--purple-dim); color: var(--purple); }
.detail-tag.type-decision { background: var(--amber-dim); color: var(--amber); }
.detail-tag.type-error { background: var(--red-dim); color: var(--red); }
.detail-tag.type-queue { background: rgba(247,120,186,0.15); color: var(--pink); }
.watermark-panel { border-top: 1px solid var(--border); background: var(--bg-secondary); }
.watermark-toggle {
padding: 10px 24px; cursor: pointer; font-size: 12px; color: var(--text-dim);
display: flex; align-items: center; gap: 8px; user-select: none;
}
.watermark-toggle:hover { color: var(--text); }
.watermark-toggle .chevron { transition: transform 0.2s; font-size: 10px; }
.watermark-toggle .chevron.open { transform: rotate(180deg); }
.watermark-content { display: none; padding: 0 24px 16px; max-height: 260px; overflow-y: auto; }
.watermark-content.open { display: block; }
.wm-table { width: 100%; border-collapse: collapse; font-size: 11px; }
.wm-table th {
text-align: left; padding: 6px 12px; color: var(--text-dim); font-weight: 600;
border-bottom: 1px solid var(--border); font-size: 10px;
text-transform: uppercase; letter-spacing: 0.5px;
}
.wm-table td { padding: 6px 12px; border-bottom: 1px solid var(--border); color: var(--text); }
.wm-table td:first-child { color: var(--cyan); font-weight: 600; }
.wm-table td:nth-child(2) { color: var(--green); }
.overview-pipeline { display: flex; gap: 0; align-items: stretch; margin: 24px 0; flex-wrap: wrap; }
.overview-stage {
flex: 1; min-width: 200px; background: var(--bg-secondary);
border: 1px solid var(--border); border-radius: 10px; padding: 20px;
cursor: pointer; transition: all 0.2s;
}
.overview-stage:hover {
border-color: var(--cyan); transform: translateY(-2px);
box-shadow: 0 6px 20px rgba(0,0,0,0.3);
}
.overview-arrow { display: flex; align-items: center; padding: 0 8px; font-size: 20px; color: var(--text-dim); }
.stage-num { font-size: 10px; font-weight: 700; text-transform: uppercase; letter-spacing: 1px; margin-bottom: 8px; }
.stage-title { font-size: 15px; font-weight: 700; color: var(--text-bright); margin-bottom: 6px; }
.stage-desc { font-size: 11px; color: var(--text-dim); line-height: 1.6; }
.stage-detail {
margin-top: 12px; padding-top: 12px; border-top: 1px solid var(--border);
font-size: 11px; color: var(--text-dim); line-height: 1.6;
}
.stage-detail code {
color: var(--amber); background: var(--amber-dim); padding: 1px 5px;
border-radius: 3px; font-size: 10px;
}
.info-box {
background: var(--bg-tertiary); border: 1px solid var(--border);
border-radius: 8px; padding: 16px; margin: 16px 0; font-size: 12px; line-height: 1.7;
}
.info-box-title { font-weight: 600; color: var(--cyan); margin-bottom: 6px; display: flex; align-items: center; gap: 6px; }
.info-box ul { margin-left: 16px; color: var(--text-dim); }
.info-box li { margin: 4px 0; }
.info-box code {
color: var(--amber); background: var(--amber-dim);
padding: 1px 5px; border-radius: 3px; font-size: 11px;
}
.legend {
display: flex; gap: 16px; flex-wrap: wrap; margin-bottom: 24px;
padding: 12px 16px; background: var(--bg-secondary);
border: 1px solid var(--border); border-radius: 8px;
}
.legend-item { display: flex; align-items: center; gap: 6px; font-size: 11px; color: var(--text-dim); }
.legend-color { width: 12px; height: 3px; border-radius: 2px; }
::-webkit-scrollbar { width: 8px; height: 8px; }
::-webkit-scrollbar-track { background: transparent; }
::-webkit-scrollbar-thumb { background: var(--border); border-radius: 4px; }
::-webkit-scrollbar-thumb:hover { background: var(--text-dim); }
</style>
</head>
<body>
<div class="sidebar">
<div class="logo">
<svg viewBox="0 0 20 20" fill="none" stroke="currentColor" stroke-width="1.5">
<circle cx="10" cy="10" r="8"/><path d="M10 6v4l3 2"/>
</svg>
lore sync
</div>
<div class="sidebar-title">Entity Flows</div>
<div class="nav-item active" data-view="overview" onclick="switchView('overview')">
<div class="nav-dot" style="background:var(--cyan)"></div>Full Sync Overview
</div>
<div class="nav-item" data-view="issues" onclick="switchView('issues')">
<div class="nav-dot" style="background:var(--green)"></div>Issues
</div>
<div class="nav-item" data-view="mrs" onclick="switchView('mrs')">
<div class="nav-dot" style="background:var(--purple)"></div>Merge Requests
</div>
<div class="nav-item" data-view="docs" onclick="switchView('docs')">
<div class="nav-dot" style="background:var(--amber)"></div>Documents
</div>
<div class="nav-item" data-view="embed" onclick="switchView('embed')">
<div class="nav-dot" style="background:var(--pink)"></div>Embeddings
</div>
</div>
<div class="main">
<div class="header">
<h1 id="view-title">Full Sync Overview</h1>
<span class="header-badge" id="view-badge">4 stages</span>
</div>
<div class="canvas-wrapper"><div class="canvas">
<!-- OVERVIEW -->
<div class="flow-container active" id="view-overview">
<div class="overview-pipeline">
<div class="overview-stage" onclick="switchView('issues')">
<div class="stage-num" style="color:var(--green)">Stage 1</div>
<div class="stage-title">Ingest Issues</div>
<div class="stage-desc">Fetch issues + discussions + resource events from GitLab API</div>
<div class="stage-detail">Cursor-based incremental sync.<br>Sequential discussion fetch.<br>Queue-based resource events.</div>
</div>
<div class="overview-arrow">&rarr;</div>
<div class="overview-stage" onclick="switchView('mrs')">
<div class="stage-num" style="color:var(--purple)">Stage 2</div>
<div class="stage-title">Ingest MRs</div>
<div class="stage-desc">Fetch merge requests + discussions + resource events</div>
<div class="stage-detail">Page-based incremental sync.<br>Parallel prefetch discussions.<br>Queue-based resource events.</div>
</div>
<div class="overview-arrow">&rarr;</div>
<div class="overview-stage" onclick="switchView('docs')">
<div class="stage-num" style="color:var(--amber)">Stage 3</div>
<div class="stage-title">Generate Docs</div>
<div class="stage-desc">Regenerate searchable documents for changed entities</div>
<div class="stage-detail">Driven by <code>dirty_sources</code> table.<br>Triple-hash skip optimization.<br>FTS5 index auto-updated.</div>
</div>
<div class="overview-arrow">&rarr;</div>
<div class="overview-stage" onclick="switchView('embed')">
<div class="stage-num" style="color:var(--pink)">Stage 4</div>
<div class="stage-title">Embed</div>
<div class="stage-desc">Generate vector embeddings via Ollama for semantic search</div>
<div class="stage-detail">Hash-based change detection.<br>Chunked, batched API calls.<br><b>Non-fatal</b> &mdash; graceful if Ollama down.</div>
</div>
</div>
<div class="info-box">
<div class="info-box-title">Concurrency Model</div>
<ul>
<li>Stages 1 &amp; 2 process <b>projects concurrently</b> via <code>buffer_unordered(primary_concurrency)</code></li>
<li>Each project gets its own <b>SQLite connection</b>; rate limiter is <b>shared</b></li>
<li>Discussions: <b>sequential</b> (issues) or <b>batched parallel prefetch</b> (MRs)</li>
<li>Resource events use a <b>persistent job queue</b> with atomic claim + exponential backoff</li>
</ul>
</div>
<div class="info-box">
<div class="info-box-title">Sync Flags</div>
<ul>
<li><code>--full</code> &mdash; Resets all cursors &amp; watermarks, forces complete re-fetch</li>
<li><code>--no-docs</code> &mdash; Skips Stage 3 (document generation)</li>
<li><code>--no-embed</code> &mdash; Skips Stage 4 (embedding generation)</li>
<li><code>--force</code> &mdash; Overrides stale single-flight lock</li>
<li><code>--project &lt;path&gt;</code> &mdash; Sync only one project (fuzzy matching)</li>
</ul>
</div>
<div class="info-box">
<div class="info-box-title">Single-Flight Lock</div>
<ul>
<li>Table-based lock (<code>AppLock</code>) prevents concurrent syncs</li>
<li>Heartbeat keeps the lock alive; stale locks auto-detected</li>
<li>Use <code>--force</code> to override a stale lock</li>
</ul>
</div>
</div>
<!-- ISSUES -->
<div class="flow-container" id="view-issues">
<div class="legend">
<div class="legend-item"><div class="legend-color" style="background:var(--cyan)"></div>API Call</div>
<div class="legend-item"><div class="legend-color" style="background:var(--purple)"></div>Transform</div>
<div class="legend-item"><div class="legend-color" style="background:var(--green)"></div>Database</div>
<div class="legend-item"><div class="legend-color" style="background:var(--amber)"></div>Decision</div>
<div class="legend-item"><div class="legend-color" style="background:var(--red)"></div>Error Path</div>
<div class="legend-item"><div class="legend-color" style="background:var(--pink)"></div>Queue</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--cyan-dim);color:var(--cyan)">1</div>
<div class="phase-title">Fetch Issues <span class="phase-subtitle">Cursor-Based Incremental Sync</span></div>
</div>
<div class="flow-row">
<div class="node api" data-detail="issue-api-call"><div class="node-title">GitLab API Call</div><div class="node-desc">paginate_issues() with<br>updated_after = cursor - rewind</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="issue-cursor-filter"><div class="node-title">Cursor Filter</div><div class="node-desc">updated_at &gt; cursor_ts<br>OR tie_breaker check</div></div>
<div class="arrow">&rarr;</div>
<div class="node transform" data-detail="issue-transform"><div class="node-title">transform_issue()</div><div class="node-desc">GitLab API shape &rarr;<br>local DB row shape</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="issue-transaction"><div class="node-title">Transaction</div><div class="node-desc">store_payload &rarr; upsert &rarr;<br>mark_dirty &rarr; relink</div></div>
</div>
<div class="arrow-down">&darr;</div>
<div class="flow-row">
<div class="node db" data-detail="issue-cursor-update"><div class="node-title">Update Cursor</div><div class="node-desc">Every 100 issues + final<br>sync_cursors table</div></div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--green-dim);color:var(--green)">2</div>
<div class="phase-title">Discussion Sync <span class="phase-subtitle">Sequential, Watermark-Based</span></div>
</div>
<div class="flow-row">
<div class="node db" data-detail="issue-disc-query"><div class="node-title">Query Stale Issues</div><div class="node-desc">updated_at &gt; COALESCE(<br>discussions_synced_for_<br>updated_at, 0)</div></div>
<div class="arrow">&rarr;</div>
<div class="node api" data-detail="issue-disc-fetch"><div class="node-title">Paginate Discussions</div><div class="node-desc">Sequential per issue<br>paginate_issue_discussions()</div></div>
<div class="arrow">&rarr;</div>
<div class="node transform" data-detail="issue-disc-transform"><div class="node-title">Transform</div><div class="node-desc">transform_discussion()<br>transform_notes()</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="issue-disc-write"><div class="node-title">Write Discussion</div><div class="node-desc">store_payload &rarr; upsert<br>DELETE notes &rarr; INSERT notes</div></div>
</div>
<div class="branch-container">
<div class="branch-label success">&#10003; On Success (all pages fetched)</div>
<div class="branch-row">
<div class="node db" data-detail="issue-disc-stale"><div class="node-title">Remove Stale</div><div class="node-desc">DELETE discussions not<br>seen in this fetch</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="issue-disc-watermark"><div class="node-title">Advance Watermark</div><div class="node-desc">discussions_synced_for_<br>updated_at = updated_at</div></div>
</div>
<div class="branch-label error">&#10007; On Pagination Error</div>
<div class="branch-row">
<div class="node error" data-detail="issue-disc-fail"><div class="node-title">Skip Stale Removal</div><div class="node-desc">Watermark NOT advanced<br>Will retry next sync</div></div>
</div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:rgba(247,120,186,0.15);color:var(--pink)">3</div>
<div class="phase-title">Resource Events <span class="phase-subtitle">Queue-Based, Concurrent Fetch</span></div>
</div>
<div class="flow-row">
<div class="node queue" data-detail="re-cleanup"><div class="node-title">Cleanup Obsolete</div><div class="node-desc">DELETE jobs where entity<br>watermark is current</div></div>
<div class="arrow">&rarr;</div>
<div class="node queue" data-detail="re-enqueue"><div class="node-title">Enqueue Jobs</div><div class="node-desc">INSERT for entities where<br>updated_at &gt; watermark</div></div>
<div class="arrow">&rarr;</div>
<div class="node queue" data-detail="re-claim"><div class="node-title">Claim Jobs</div><div class="node-desc">Atomic UPDATE...RETURNING<br>with lock acquisition</div></div>
<div class="arrow">&rarr;</div>
<div class="node api" data-detail="re-fetch"><div class="node-title">Fetch Events</div><div class="node-desc">3 concurrent: state +<br>label + milestone</div></div>
</div>
<div class="branch-container">
<div class="branch-label success">&#10003; On Success</div>
<div class="branch-row">
<div class="node db" data-detail="re-store"><div class="node-title">Store Events</div><div class="node-desc">Transaction: upsert all<br>3 event types</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="re-complete"><div class="node-title">Complete + Watermark</div><div class="node-desc">DELETE job row<br>Advance watermark</div></div>
</div>
<div class="branch-label error">&#10007; Permanent Error (404 / 403)</div>
<div class="branch-row">
<div class="node error" data-detail="re-permanent"><div class="node-title">Skip Permanently</div><div class="node-desc">complete_job + advance<br>watermark (coalesced)</div></div>
</div>
<div class="branch-label retry">&#8635; Transient Error</div>
<div class="branch-row">
<div class="node error" data-detail="re-transient"><div class="node-title">Backoff Retry</div><div class="node-desc">fail_job: 30s x 2^(n-1)<br>capped at 480s</div></div>
</div>
</div>
</div>
</div>
<!-- MERGE REQUESTS -->
<div class="flow-container" id="view-mrs">
<div class="legend">
<div class="legend-item"><div class="legend-color" style="background:var(--cyan)"></div>API Call</div>
<div class="legend-item"><div class="legend-color" style="background:var(--purple)"></div>Transform</div>
<div class="legend-item"><div class="legend-color" style="background:var(--green)"></div>Database</div>
<div class="legend-item"><div class="legend-color" style="background:var(--amber)"></div>Diff from Issues</div>
<div class="legend-item"><div class="legend-color" style="background:var(--red)"></div>Error Path</div>
<div class="legend-item"><div class="legend-color" style="background:var(--pink)"></div>Queue</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--cyan-dim);color:var(--cyan)">1</div>
<div class="phase-title">Fetch MRs <span class="phase-subtitle">Page-Based Incremental Sync</span></div>
</div>
<div class="flow-row">
<div class="node api" data-detail="mr-api-call"><div class="node-title">GitLab API Call</div><div class="node-desc">fetch_merge_requests_page()<br>with cursor rewind</div><div class="diff-badge changed">Page-based, not streaming</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="mr-cursor-filter"><div class="node-title">Cursor Filter</div><div class="node-desc">Same logic as issues:<br>timestamp + tie-breaker</div><div class="diff-badge same">Same as issues</div></div>
<div class="arrow">&rarr;</div>
<div class="node transform" data-detail="mr-transform"><div class="node-title">transform_merge_request()</div><div class="node-desc">Maps API shape &rarr;<br>local DB row</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="mr-transaction"><div class="node-title">Transaction</div><div class="node-desc">store &rarr; upsert &rarr; dirty &rarr;<br>labels + assignees + reviewers</div><div class="diff-badge changed">3 junction tables (not 2)</div></div>
</div>
<div class="arrow-down">&darr;</div>
<div class="flow-row">
<div class="node db" data-detail="mr-cursor-update"><div class="node-title">Update Cursor</div><div class="node-desc">Per page (not every 100)</div><div class="diff-badge changed">Per page boundary</div></div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--green-dim);color:var(--green)">2</div>
<div class="phase-title">MR Discussion Sync <span class="phase-subtitle">Parallel Prefetch + Serial Write</span></div>
</div>
<div class="info-box" style="margin-left:40px;margin-bottom:16px;">
<div class="info-box-title">Key Differences from Issue Discussions</div>
<ul>
<li><b>Parallel prefetch</b> &mdash; fetches all discussions for a batch concurrently via <code>join_all()</code></li>
<li><b>Upsert pattern</b> &mdash; notes use INSERT...ON CONFLICT (not delete-all + re-insert)</li>
<li><b>Sweep stale</b> &mdash; uses <code>last_seen_at</code> timestamp comparison (not set difference)</li>
<li><b>Sync health tracking</b> &mdash; records <code>discussions_sync_attempts</code> and <code>last_error</code></li>
</ul>
</div>
<div class="flow-row">
<div class="node db" data-detail="mr-disc-query"><div class="node-title">Query Stale MRs</div><div class="node-desc">updated_at &gt; COALESCE(<br>discussions_synced_for_<br>updated_at, 0)</div><div class="diff-badge same">Same watermark logic</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="mr-disc-batch"><div class="node-title">Batch by Concurrency</div><div class="node-desc">dependent_concurrency<br>MRs per batch</div><div class="diff-badge changed">Batched processing</div></div>
</div>
<div class="arrow-down">&darr;</div>
<div class="flow-row">
<div class="node api" data-detail="mr-disc-prefetch"><div class="node-title">Parallel Prefetch</div><div class="node-desc">join_all() fetches all<br>discussions for batch</div><div class="diff-badge changed">Parallel (not sequential)</div></div>
<div class="arrow">&rarr;</div>
<div class="node transform" data-detail="mr-disc-transform"><div class="node-title">Transform In-Memory</div><div class="node-desc">transform_mr_discussion()<br>+ diff position notes</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="mr-disc-write"><div class="node-title">Serial Write</div><div class="node-desc">upsert discussion<br>upsert notes (ON CONFLICT)</div><div class="diff-badge changed">Upsert, not delete+insert</div></div>
</div>
<div class="branch-container">
<div class="branch-label success">&#10003; On Full Success</div>
<div class="branch-row">
<div class="node db" data-detail="mr-disc-sweep"><div class="node-title">Sweep Stale</div><div class="node-desc">DELETE WHERE last_seen_at<br>&lt; run_seen_at (disc + notes)</div><div class="diff-badge changed">last_seen_at sweep</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="mr-disc-watermark"><div class="node-title">Advance Watermark</div><div class="node-desc">discussions_synced_for_<br>updated_at = updated_at</div></div>
</div>
<div class="branch-label error">&#10007; On Failure</div>
<div class="branch-row">
<div class="node error" data-detail="mr-disc-fail"><div class="node-title">Record Sync Health</div><div class="node-desc">Watermark NOT advanced<br>Tracks attempts + last_error</div><div class="diff-badge changed">Health tracking</div></div>
</div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:rgba(247,120,186,0.15);color:var(--pink)">3</div>
<div class="phase-title">Resource Events <span class="phase-subtitle">Same as Issues</span></div>
</div>
<div class="info-box" style="margin-left:40px">
<div class="info-box-title">Identical to Issue Resource Events</div>
<ul>
<li>Same queue-based approach: cleanup &rarr; enqueue &rarr; claim &rarr; fetch &rarr; store/fail</li>
<li>Same watermark column: <code>resource_events_synced_for_updated_at</code></li>
<li>Same error handling: 404/403 coalesced to empty, transient errors get backoff</li>
<li>entity_type = <code>"merge_request"</code> instead of <code>"issue"</code></li>
</ul>
</div>
</div>
</div>
<!-- DOCUMENTS -->
<div class="flow-container" id="view-docs">
<div class="legend">
<div class="legend-item"><div class="legend-color" style="background:var(--cyan)"></div>Trigger</div>
<div class="legend-item"><div class="legend-color" style="background:var(--purple)"></div>Extract</div>
<div class="legend-item"><div class="legend-color" style="background:var(--green)"></div>Database</div>
<div class="legend-item"><div class="legend-color" style="background:var(--amber)"></div>Decision</div>
<div class="legend-item"><div class="legend-color" style="background:var(--red)"></div>Error</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--cyan-dim);color:var(--cyan)">1</div>
<div class="phase-title">Dirty Source Queue <span class="phase-subtitle">Populated During Ingestion</span></div>
</div>
<div class="flow-row">
<div class="node api" data-detail="doc-trigger"><div class="node-title">mark_dirty_tx()</div><div class="node-desc">Called during every issue/<br>MR/discussion upsert</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="doc-dirty-table"><div class="node-title">dirty_sources Table</div><div class="node-desc">INSERT (source_type, source_id)<br>ON CONFLICT reset backoff</div></div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--amber-dim);color:var(--amber)">2</div>
<div class="phase-title">Drain Loop <span class="phase-subtitle">Batch 500, Respects Backoff</span></div>
</div>
<div class="flow-row">
<div class="node db" data-detail="doc-drain"><div class="node-title">Get Dirty Sources</div><div class="node-desc">Batch 500, ORDER BY<br>attempt_count, queued_at</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="doc-dispatch"><div class="node-title">Dispatch by Type</div><div class="node-desc">issue / mr / discussion<br>&rarr; extract function</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="doc-deleted-check"><div class="node-title">Source Exists?</div><div class="node-desc">If deleted: remove doc row<br>(cascade cleans FTS + embeds)</div></div>
</div>
<div class="arrow-down">&darr;</div>
<div class="flow-row">
<div class="node transform" data-detail="doc-extract"><div class="node-title">Extract Content</div><div class="node-desc">Structured text:<br>header + metadata + body</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="doc-triple-hash"><div class="node-title">Triple-Hash Check</div><div class="node-desc">content_hash + labels_hash<br>+ paths_hash all match?</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="doc-write"><div class="node-title">SAVEPOINT Write</div><div class="node-desc">Atomic: document row +<br>labels + paths</div></div>
</div>
<div class="branch-container">
<div class="branch-label success">&#10003; On Success</div>
<div class="branch-row">
<div class="node db" data-detail="doc-clear"><div class="node-title">clear_dirty()</div><div class="node-desc">Remove from dirty_sources</div></div>
</div>
<div class="branch-label error">&#10007; On Error</div>
<div class="branch-row">
<div class="node error" data-detail="doc-error"><div class="node-title">record_dirty_error()</div><div class="node-desc">Increment attempt_count<br>Exponential backoff</div></div>
</div>
<div class="branch-label" style="color:var(--purple)">&#8801; Triple-Hash Match (skip)</div>
<div class="branch-row">
<div class="node db" data-detail="doc-skip"><div class="node-title">Skip Write</div><div class="node-desc">All 3 hashes match &rarr;<br>no WAL churn, clear dirty</div></div>
</div>
</div>
</div>
<div class="info-box">
<div class="info-box-title">Full Mode (<code>--full</code>)</div>
<ul>
<li>Seeds <b>ALL</b> entities into <code>dirty_sources</code> via keyset pagination</li>
<li>Triple-hash optimization prevents redundant writes even in full mode</li>
<li>Runs FTS <code>OPTIMIZE</code> after drain completes</li>
</ul>
</div>
</div>
<!-- EMBEDDINGS -->
<div class="flow-container" id="view-embed">
<div class="legend">
<div class="legend-item"><div class="legend-color" style="background:var(--cyan)"></div>API (Ollama)</div>
<div class="legend-item"><div class="legend-color" style="background:var(--purple)"></div>Processing</div>
<div class="legend-item"><div class="legend-color" style="background:var(--green)"></div>Database</div>
<div class="legend-item"><div class="legend-color" style="background:var(--amber)"></div>Decision</div>
<div class="legend-item"><div class="legend-color" style="background:var(--red)"></div>Error</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--amber-dim);color:var(--amber)">1</div>
<div class="phase-title">Change Detection <span class="phase-subtitle">Hash + Config Drift</span></div>
</div>
<div class="flow-row">
<div class="node decision" data-detail="embed-detect"><div class="node-title">find_pending_documents()</div><div class="node-desc">No metadata row? OR<br>document_hash mismatch? OR<br>config drift?</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="embed-paginate"><div class="node-title">Keyset Pagination</div><div class="node-desc">500 documents per page<br>ordered by doc ID</div></div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--purple-dim);color:var(--purple)">2</div>
<div class="phase-title">Chunking <span class="phase-subtitle">Split + Overflow Guard</span></div>
</div>
<div class="flow-row">
<div class="node transform" data-detail="embed-chunk"><div class="node-title">split_into_chunks()</div><div class="node-desc">Split by paragraph boundaries<br>with configurable overlap</div></div>
<div class="arrow">&rarr;</div>
<div class="node decision" data-detail="embed-overflow"><div class="node-title">Overflow Guard</div><div class="node-desc">Too many chunks?<br>Skip to prevent rowid collision</div></div>
<div class="arrow">&rarr;</div>
<div class="node transform" data-detail="embed-work"><div class="node-title">Build ChunkWork</div><div class="node-desc">Assign encoded chunk IDs<br>per document</div></div>
</div>
</div>
<div class="phase">
<div class="phase-header">
<div class="phase-number" style="background:var(--cyan-dim);color:var(--cyan)">3</div>
<div class="phase-title">Ollama Embedding <span class="phase-subtitle">Batched API Calls</span></div>
</div>
<div class="flow-row">
<div class="node api" data-detail="embed-batch"><div class="node-title">Batch Embed</div><div class="node-desc">32 chunks per Ollama<br>API call</div></div>
<div class="arrow">&rarr;</div>
<div class="node db" data-detail="embed-store"><div class="node-title">Store Vectors</div><div class="node-desc">sqlite-vec embeddings table<br>+ embedding_metadata</div></div>
</div>
<div class="branch-container">
<div class="branch-label success">&#10003; On Success</div>
<div class="branch-row">
<div class="node db" data-detail="embed-success"><div class="node-title">SAVEPOINT Commit</div><div class="node-desc">Atomic per page:<br>clear old + write new</div></div>
</div>
<div class="branch-label retry">&#8635; Context-Length Error</div>
<div class="branch-row">
<div class="node error" data-detail="embed-ctx-error"><div class="node-title">Retry Individually</div><div class="node-desc">Re-embed each chunk solo<br>to isolate oversized one</div></div>
</div>
<div class="branch-label error">&#10007; Other Error</div>
<div class="branch-row">
<div class="node error" data-detail="embed-other-error"><div class="node-title">Record Error</div><div class="node-desc">Store in embedding_metadata<br>for retry next run</div></div>
</div>
</div>
</div>
<div class="info-box">
<div class="info-box-title">Full Mode (<code>--full</code>)</div>
<ul>
<li>DELETEs all <code>embedding_metadata</code> and <code>embeddings</code> rows first</li>
<li>Every document re-processed from scratch</li>
</ul>
</div>
<div class="info-box">
<div class="info-box-title">Non-Fatal in Sync</div>
<ul>
<li>Stage 4 failures (Ollama down, model missing) are <b>graceful</b></li>
<li>Sync completes successfully; embeddings just won't be updated</li>
<li>Semantic search degrades to FTS-only mode</li>
</ul>
</div>
</div>
</div></div>
<!-- Watermark Panel -->
<div class="watermark-panel">
<div class="watermark-toggle" onclick="toggleWatermarks()">
<span class="chevron" id="wm-chevron">&#9650;</span>
Watermark &amp; Cursor Reference
</div>
<div class="watermark-content" id="wm-content">
<table class="wm-table">
<thead><tr><th>Table</th><th>Column(s)</th><th>Purpose</th></tr></thead>
<tbody>
<tr><td>sync_cursors</td><td>updated_at_cursor + tie_breaker_id</td><td>Incremental fetch: "last entity we saw" per project+type</td></tr>
<tr><td>issues</td><td>discussions_synced_for_updated_at</td><td>Per-issue discussion watermark</td></tr>
<tr><td>issues</td><td>resource_events_synced_for_updated_at</td><td>Per-issue resource event watermark</td></tr>
<tr><td>merge_requests</td><td>discussions_synced_for_updated_at</td><td>Per-MR discussion watermark</td></tr>
<tr><td>merge_requests</td><td>resource_events_synced_for_updated_at</td><td>Per-MR resource event watermark</td></tr>
<tr><td>dirty_sources</td><td>queued_at + next_attempt_at</td><td>Document regeneration queue with backoff</td></tr>
<tr><td>embedding_metadata</td><td>document_hash + chunk_max_bytes + model + dims</td><td>Embedding staleness detection</td></tr>
<tr><td>pending_dependent_fetches</td><td>locked_at + next_retry_at + attempts</td><td>Resource event job queue with backoff</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<!-- Detail Panel -->
<div class="detail-panel" id="detail-panel">
<div class="detail-header">
<h2 id="detail-title">Node Details</h2>
<button class="detail-close" onclick="closeDetail()">&times;</button>
</div>
<div class="detail-body" id="detail-body"></div>
</div>
<script>
const viewTitles = {
overview: 'Full Sync Overview', issues: 'Issue Ingestion Flow',
mrs: 'Merge Request Ingestion Flow', docs: 'Document Generation Flow',
embed: 'Embedding Generation Flow',
};
const viewBadges = {
overview: '4 stages', issues: '3 phases', mrs: '3 phases',
docs: '2 phases', embed: '3 phases',
};
function switchView(view) {
document.querySelectorAll('.flow-container').forEach(function(el) { el.classList.remove('active'); });
document.getElementById('view-' + view).classList.add('active');
document.querySelectorAll('.nav-item').forEach(function(el) {
el.classList.toggle('active', el.dataset.view === view);
});
document.getElementById('view-title').textContent = viewTitles[view];
document.getElementById('view-badge').textContent = viewBadges[view];
closeDetail();
}
function toggleWatermarks() {
document.getElementById('wm-content').classList.toggle('open');
document.getElementById('wm-chevron').classList.toggle('open');
}
var details = {
'issue-api-call': { title: 'GitLab API: Paginate Issues', type: 'api', file: 'src/ingestion/issues.rs:51-140', desc: 'Streams issues from the GitLab API using cursor-based incremental sync. The API is called with updated_after set to the last known cursor minus a configurable rewind window (to handle clock skew between GitLab and the local database).', sql: 'GET /api/v4/projects/{id}/issues\n ?updated_after={cursor - rewind_seconds}\n &order_by=updated_at&sort=asc\n &per_page=100' },
'issue-cursor-filter': { title: 'Cursor Filter (Dedup)', type: 'decision', file: 'src/ingestion/issues.rs:95-110', desc: 'Because of the cursor rewind, some issues will be re-fetched that we already have. The cursor filter skips these using a two-part comparison: primary on updated_at timestamp, with gitlab_id as a tie-breaker when timestamps are equal.', sql: '// Pseudocode:\nif issue.updated_at > cursor_ts:\n ACCEPT // newer than cursor\nelif issue.updated_at == cursor_ts\n AND issue.gitlab_id > tie_breaker_id:\n ACCEPT // same timestamp, higher ID\nelse:\n SKIP // already processed' },
'issue-transform': { title: 'Transform Issue', type: 'transform', file: 'src/gitlab/transformers/issue.rs', desc: 'Maps the GitLab API response shape to the local database row shape. Parses ISO 8601 timestamps to milliseconds-since-epoch, extracts label names, assignee usernames, milestone info, and due dates.' },
'issue-transaction': { title: 'Issue Write Transaction', type: 'db', file: 'src/ingestion/issues.rs:190-220', desc: 'All operations for a single issue are wrapped in one SQLite transaction for atomicity. If any step fails, the entire issue write is rolled back.', sql: 'BEGIN;\n-- 1. Store raw JSON payload (compressed, deduped)\nINSERT INTO payloads ...;\n-- 2. Upsert issue row\nINSERT INTO issues ... ON CONFLICT(gitlab_id)\n DO UPDATE SET ...;\n-- 3. Mark dirty for document regen\nINSERT INTO dirty_sources ...;\n-- 4. Relink labels\nDELETE FROM issue_labels WHERE issue_id = ?;\nINSERT INTO labels ... ON CONFLICT DO UPDATE;\nINSERT INTO issue_labels ...;\n-- 5. Relink assignees\nDELETE FROM issue_assignees WHERE issue_id = ?;\nINSERT INTO issue_assignees ...;\nCOMMIT;' },
'issue-cursor-update': { title: 'Update Sync Cursor', type: 'db', file: 'src/ingestion/issues.rs:130-140', desc: 'The sync cursor is updated every 100 issues (for crash recovery) and once at the end of the stream. If the process crashes mid-sync, it resumes from at most 100 issues back.', sql: 'INSERT INTO sync_cursors\n (project_id, resource_type,\n updated_at_cursor, tie_breaker_id)\nVALUES (?1, \'issues\', ?2, ?3)\nON CONFLICT(project_id, resource_type)\n DO UPDATE SET\n updated_at_cursor = ?2,\n tie_breaker_id = ?3;' },
'issue-disc-query': { title: 'Query Issues Needing Discussion Sync', type: 'db', file: 'src/ingestion/issues.rs:450-471', desc: 'Finds all issues in this project whose updated_at timestamp exceeds their per-row discussion watermark. Issues that have not changed since their last discussion sync are skipped entirely.', sql: 'SELECT id, iid, updated_at\nFROM issues\nWHERE project_id = ?1\n AND updated_at > COALESCE(\n discussions_synced_for_updated_at, 0\n );' },
'issue-disc-fetch': { title: 'Paginate Issue Discussions', type: 'api', file: 'src/ingestion/discussions.rs:73-205', desc: 'Discussions are fetched sequentially per issue (rusqlite Connection is not Send, so async parallelism is not possible here). Each issue\'s discussions are streamed page by page from the GitLab API.', sql: 'GET /api/v4/projects/{id}/issues/{iid}\n /discussions?per_page=100' },
'issue-disc-transform': { title: 'Transform Discussion + Notes', type: 'transform', file: 'src/gitlab/transformers/discussion.rs', desc: 'Transforms the raw GitLab discussion payload into normalized rows. Sets NoteableRef::Issue. Computes resolvable/resolved status, first_note_at/last_note_at timestamps, and per-note position indices.' },
'issue-disc-write': { title: 'Write Discussion (Full Refresh)', type: 'db', file: 'src/ingestion/discussions.rs:140-180', desc: 'Issue discussions use a full-refresh pattern: all existing notes for a discussion are deleted and re-inserted. This is simpler than upsert but means partial failures lose the previous state.', sql: 'BEGIN;\nINSERT INTO payloads ...;\nINSERT INTO discussions ... ON CONFLICT DO UPDATE;\nINSERT INTO dirty_sources ...;\n-- Full refresh: delete all then re-insert\nDELETE FROM notes WHERE discussion_id = ?;\nINSERT INTO notes VALUES (...);\nCOMMIT;' },
'issue-disc-stale': { title: 'Remove Stale Discussions', type: 'db', file: 'src/ingestion/discussions.rs:185-195', desc: 'After successfully fetching ALL discussion pages for an issue, any discussions in the DB that were not seen in this fetch are deleted. Uses a temp table for >500 IDs to avoid SQLite\'s 999-variable limit.', sql: '-- For small sets (<= 500):\nDELETE FROM discussions\nWHERE issue_id = ?\n AND gitlab_id NOT IN (...);\n\n-- For large sets (> 500):\nCREATE TEMP TABLE seen_ids(id TEXT);\nINSERT INTO seen_ids ...;\nDELETE FROM discussions\nWHERE issue_id = ?\n AND gitlab_id NOT IN\n (SELECT id FROM seen_ids);\nDROP TABLE seen_ids;' },
'issue-disc-watermark': { title: 'Advance Discussion Watermark', type: 'db', file: 'src/ingestion/discussions.rs:198', desc: 'Sets the per-issue watermark to the issue\'s current updated_at, signaling that discussions are now synced for this version of the issue.', sql: 'UPDATE issues\nSET discussions_synced_for_updated_at\n = updated_at\nWHERE id = ?;' },
'issue-disc-fail': { title: 'Pagination Error Handling', type: 'error', file: 'src/ingestion/discussions.rs:182', desc: 'If pagination fails mid-stream, stale discussion removal is skipped (we don\'t know the full set) and the watermark is NOT advanced. The issue will be retried on the next sync run.' },
're-cleanup': { title: 'Cleanup Obsolete Jobs', type: 'queue', file: 'src/ingestion/orchestrator.rs:490-520', desc: 'Before enqueuing new jobs, delete any existing jobs for entities whose watermark is already current. These are leftover from a previous run.', sql: 'DELETE FROM pending_dependent_fetches\nWHERE project_id = ?\n AND job_type = \'resource_events\'\n AND entity_local_id IN (\n SELECT id FROM issues\n WHERE project_id = ?\n AND updated_at <= COALESCE(\n resource_events_synced_for_updated_at, 0\n )\n );' },
're-enqueue': { title: 'Enqueue Resource Event Jobs', type: 'queue', file: 'src/ingestion/orchestrator.rs:525-555', desc: 'For each entity whose updated_at exceeds its resource event watermark, insert a job into the queue. Uses INSERT OR IGNORE for idempotency.', sql: 'INSERT OR IGNORE INTO pending_dependent_fetches\n (project_id, entity_type, entity_iid,\n entity_local_id, job_type, enqueued_at)\nSELECT project_id, \'issue\', iid, id,\n \'resource_events\', ?now\nFROM issues\nWHERE project_id = ?\n AND updated_at > COALESCE(\n resource_events_synced_for_updated_at, 0\n );' },
're-claim': { title: 'Claim Jobs (Atomic Lock)', type: 'queue', file: 'src/core/dependent_queue.rs', desc: 'Atomically claims a batch of unlocked jobs whose backoff period has elapsed. Uses UPDATE...RETURNING for lock acquisition in a single statement.', sql: 'UPDATE pending_dependent_fetches\nSET locked_at = ?now\nWHERE rowid IN (\n SELECT rowid\n FROM pending_dependent_fetches\n WHERE project_id = ?\n AND job_type = \'resource_events\'\n AND locked_at IS NULL\n AND (next_retry_at IS NULL\n OR next_retry_at <= ?now)\n ORDER BY enqueued_at ASC\n LIMIT ?batch_size\n)\nRETURNING *;' },
're-fetch': { title: 'Fetch 3 Event Types Concurrently', type: 'api', file: 'src/gitlab/client.rs:732-771', desc: 'Uses tokio::join! (not try_join!) to fetch state, label, and milestone events concurrently. Permanent errors (404, 403) are coalesced to empty vecs via coalesce_inaccessible().', sql: 'tokio::join!(\n fetch_issue_state_events(proj, iid),\n fetch_issue_label_events(proj, iid),\n fetch_issue_milestone_events(proj, iid),\n)\n// Each: coalesce_inaccessible()\n// 404/403 -> Ok(vec![])\n// Other errors -> propagated' },
're-store': { title: 'Store Resource Events', type: 'db', file: 'src/ingestion/orchestrator.rs:620-640', desc: 'All three event types are upserted in a single transaction.', sql: 'BEGIN;\nINSERT INTO resource_state_events ...\n ON CONFLICT DO UPDATE;\nINSERT INTO resource_label_events ...\n ON CONFLICT DO UPDATE;\nINSERT INTO resource_milestone_events ...\n ON CONFLICT DO UPDATE;\nCOMMIT;' },
're-complete': { title: 'Complete Job + Advance Watermark', type: 'db', file: 'src/ingestion/orchestrator.rs:645-660', desc: 'After successful storage, the job row is deleted and the entity\'s watermark is advanced.', sql: 'DELETE FROM pending_dependent_fetches\n WHERE rowid = ?;\n\nUPDATE issues\nSET resource_events_synced_for_updated_at\n = updated_at\nWHERE id = ?;' },
're-permanent': { title: 'Permanent Error: Skip Entity', type: 'error', file: 'src/ingestion/orchestrator.rs:665-680', desc: '404 (endpoint doesn\'t exist) and 403 (insufficient permissions) are permanent. The job is completed and watermark advanced, so this entity is permanently skipped until next updated on GitLab.' },
're-transient': { title: 'Transient Error: Exponential Backoff', type: 'error', file: 'src/core/dependent_queue.rs', desc: 'Network errors, 500s, rate limits get exponential backoff. Formula: 30s * 2^(attempts-1), capped at 480s (8 minutes).', sql: 'UPDATE pending_dependent_fetches\nSET locked_at = NULL,\n attempts = attempts + 1,\n next_retry_at = ?now\n + 30000 * pow(2, attempts),\n -- capped at 480000ms (8 min)\n last_error = ?error_msg\nWHERE rowid = ?;' },
'mr-api-call': { title: 'GitLab API: Fetch MR Pages', type: 'api', file: 'src/ingestion/merge_requests.rs:51-151', desc: 'Unlike issues which stream, MRs use explicit page-based pagination via fetch_merge_requests_page(). Each page returns items plus a next_page indicator.', sql: 'GET /api/v4/projects/{id}/merge_requests\n ?updated_after={cursor - rewind}\n &order_by=updated_at&sort=asc\n &per_page=100&page={n}' },
'mr-cursor-filter': { title: 'Cursor Filter', type: 'decision', file: 'src/ingestion/merge_requests.rs:90-105', desc: 'Identical logic to issues: timestamp comparison with gitlab_id tie-breaker.' },
'mr-transform': { title: 'Transform Merge Request', type: 'transform', file: 'src/gitlab/transformers/mr.rs', desc: 'Maps GitLab MR response to local row. Handles draft detection (prefers draft field, falls back to work_in_progress), detailed_merge_status, merge_user resolution, and reviewer extraction.' },
'mr-transaction': { title: 'MR Write Transaction', type: 'db', file: 'src/ingestion/merge_requests.rs:170-210', desc: 'Same pattern as issues but with THREE junction tables: labels, assignees, AND reviewers.', sql: 'BEGIN;\nINSERT INTO payloads ...;\nINSERT INTO merge_requests ...\n ON CONFLICT DO UPDATE;\nINSERT INTO dirty_sources ...;\n-- 3 junction tables:\nDELETE FROM mr_labels WHERE mr_id = ?;\nINSERT INTO mr_labels ...;\nDELETE FROM mr_assignees WHERE mr_id = ?;\nINSERT INTO mr_assignees ...;\nDELETE FROM mr_reviewers WHERE mr_id = ?;\nINSERT INTO mr_reviewers ...;\nCOMMIT;' },
'mr-cursor-update': { title: 'Update Cursor Per Page', type: 'db', file: 'src/ingestion/merge_requests.rs:140-150', desc: 'Unlike issues (every 100 items), MR cursor is updated at each page boundary for better crash recovery.' },
'mr-disc-query': { title: 'Query MRs Needing Discussion Sync', type: 'db', file: 'src/ingestion/merge_requests.rs:430-451', desc: 'Same watermark pattern as issues. Runs AFTER MR ingestion to avoid memory growth.', sql: 'SELECT id, iid, updated_at\nFROM merge_requests\nWHERE project_id = ?1\n AND updated_at > COALESCE(\n discussions_synced_for_updated_at, 0\n );' },
'mr-disc-batch': { title: 'Batch by Concurrency', type: 'decision', file: 'src/ingestion/orchestrator.rs:420-465', desc: 'MRs are processed in batches sized by dependent_concurrency. Each batch first prefetches all discussions in parallel, then writes serially.' },
'mr-disc-prefetch': { title: 'Parallel Prefetch', type: 'api', file: 'src/ingestion/mr_discussions.rs:66-120', desc: 'All MRs in the batch have their discussions fetched concurrently via join_all(). Each MR\'s discussions are fetched in one call, transformed in memory, and returned as PrefetchedMrDiscussions.', sql: '// For each MR in batch, concurrently:\nGET /api/v4/projects/{id}/merge_requests\n /{iid}/discussions?per_page=100\n\n// All fetched + transformed in memory\n// before any DB writes happen' },
'mr-disc-transform': { title: 'Transform MR Discussions', type: 'transform', file: 'src/ingestion/mr_discussions.rs:125-160', desc: 'Uses transform_mr_discussion() which additionally handles DiffNote positions (file paths, line ranges, SHA triplets).' },
'mr-disc-write': { title: 'Serial Write (Upsert Pattern)', type: 'db', file: 'src/ingestion/mr_discussions.rs:165-220', desc: 'Unlike issue discussions (delete-all + re-insert), MR discussions use INSERT...ON CONFLICT DO UPDATE for both discussions and notes. Safer for partial failures.', sql: 'BEGIN;\nINSERT INTO payloads ...;\nINSERT INTO discussions ...\n ON CONFLICT DO UPDATE\n SET ..., last_seen_at = ?run_ts;\nINSERT INTO dirty_sources ...;\n-- Upsert notes (not delete+insert):\nINSERT INTO notes ...\n ON CONFLICT DO UPDATE\n SET ..., last_seen_at = ?run_ts;\nCOMMIT;' },
'mr-disc-sweep': { title: 'Sweep Stale (last_seen_at)', type: 'db', file: 'src/ingestion/mr_discussions.rs:225-245', desc: 'Staleness detected via last_seen_at timestamps. Both discussions AND notes are swept independently.', sql: '-- Sweep stale discussions:\nDELETE FROM discussions\nWHERE merge_request_id = ?\n AND last_seen_at < ?run_seen_at;\n\n-- Sweep stale notes:\nDELETE FROM notes\nWHERE discussion_id IN (\n SELECT id FROM discussions\n WHERE merge_request_id = ?\n) AND last_seen_at < ?run_seen_at;' },
'mr-disc-watermark': { title: 'Advance MR Discussion Watermark', type: 'db', file: 'src/ingestion/mr_discussions.rs:248', desc: 'Same as issues: stamps the per-MR watermark.', sql: 'UPDATE merge_requests\nSET discussions_synced_for_updated_at\n = updated_at\nWHERE id = ?;' },
'mr-disc-fail': { title: 'Failure: Sync Health Tracking', type: 'error', file: 'src/ingestion/mr_discussions.rs:252-260', desc: 'Unlike issues, MR discussion failures are tracked: discussions_sync_attempts is incremented and discussions_sync_last_error is recorded. Watermark is NOT advanced.' },
'doc-trigger': { title: 'mark_dirty_tx()', type: 'api', file: 'src/ingestion/dirty_tracker.rs', desc: 'Called during every upsert in ingestion. Inserts into dirty_sources, or on conflict resets backoff. This bridges ingestion (stages 1-2) and document generation (stage 3).', sql: 'INSERT INTO dirty_sources\n (source_type, source_id, queued_at)\nVALUES (?1, ?2, ?now)\nON CONFLICT(source_type, source_id)\n DO UPDATE SET\n queued_at = ?now,\n attempt_count = 0,\n next_attempt_at = NULL,\n last_error = NULL;' },
'doc-dirty-table': { title: 'dirty_sources Table', type: 'db', file: 'src/ingestion/dirty_tracker.rs', desc: 'Persistent queue of entities needing document regeneration. Supports exponential backoff for failed extractions.' },
'doc-drain': { title: 'Get Dirty Sources (Batched)', type: 'db', file: 'src/documents/regenerator.rs:35-45', desc: 'Fetches up to 500 dirty entries per batch, prioritizing fewer attempts. Respects exponential backoff.', sql: 'SELECT source_type, source_id\nFROM dirty_sources\nWHERE next_attempt_at IS NULL\n OR next_attempt_at <= ?now\nORDER BY attempt_count ASC,\n queued_at ASC\nLIMIT 500;' },
'doc-dispatch': { title: 'Dispatch by Source Type', type: 'decision', file: 'src/documents/extractor.rs', desc: 'Routes to the appropriate extraction function: "issue" -> extract_issue_document(), "merge_request" -> extract_mr_document(), "discussion" -> extract_discussion_document().' },
'doc-deleted-check': { title: 'Source Exists Check', type: 'decision', file: 'src/documents/regenerator.rs:48-55', desc: 'If the source entity was deleted, the extractor returns None. The regenerator deletes the document row. FK cascades clean up FTS and embeddings.' },
'doc-extract': { title: 'Extract Structured Content', type: 'transform', file: 'src/documents/extractor.rs', desc: 'Builds searchable text:\n[[Issue]] #42: Title\nProject: group/repo\nURL: ...\nLabels: [bug, urgent]\nState: opened\n\n--- Description ---\n...\n\nDiscussions inherit parent labels and extract DiffNote file paths.' },
'doc-triple-hash': { title: 'Triple-Hash Write Optimization', type: 'decision', file: 'src/documents/regenerator.rs:55-62', desc: 'Checks content_hash + labels_hash + paths_hash against existing document. If ALL three match, write is completely skipped. Critical for --full mode performance.' },
'doc-write': { title: 'SAVEPOINT Atomic Write', type: 'db', file: 'src/documents/regenerator.rs:58-65', desc: 'Document, labels, and paths written inside a SAVEPOINT for atomicity.', sql: 'SAVEPOINT doc_write;\nINSERT INTO documents ...\n ON CONFLICT DO UPDATE SET\n content = ?, content_hash = ?,\n labels_hash = ?, paths_hash = ?;\nDELETE FROM document_labels\n WHERE doc_id = ?;\nINSERT INTO document_labels ...;\nDELETE FROM document_paths\n WHERE doc_id = ?;\nINSERT INTO document_paths ...;\nRELEASE doc_write;' },
'doc-clear': { title: 'Clear Dirty Entry', type: 'db', file: 'src/ingestion/dirty_tracker.rs', desc: 'On success, the dirty_sources row is deleted.', sql: 'DELETE FROM dirty_sources\nWHERE source_type = ?\n AND source_id = ?;' },
'doc-error': { title: 'Record Error + Backoff', type: 'error', file: 'src/ingestion/dirty_tracker.rs', desc: 'Increments attempt_count, sets next_attempt_at with exponential backoff. Entry stays for retry.', sql: 'UPDATE dirty_sources\nSET attempt_count = attempt_count + 1,\n next_attempt_at = ?now\n + compute_backoff(attempt_count),\n last_error = ?error_msg\nWHERE source_type = ?\n AND source_id = ?;' },
'doc-skip': { title: 'Skip Write (Hash Match)', type: 'db', file: 'src/documents/regenerator.rs:57', desc: 'When all three hashes match, the document has not actually changed. Common when updated_at changes but content/labels/paths remain the same. Dirty entry is cleared without writes.' },
'embed-detect': { title: 'Change Detection', type: 'decision', file: 'src/embedding/change_detector.rs', desc: 'Document needs re-embedding if: (1) No embedding_metadata row, (2) document_hash mismatch, (3) Config drift in chunk_max_bytes, model, or dims.', sql: 'SELECT d.id, d.content, d.content_hash\nFROM documents d\nLEFT JOIN embedding_metadata em\n ON em.document_id = d.id\nWHERE em.document_id IS NULL\n OR em.document_hash != d.content_hash\n OR em.chunk_max_bytes != ?config\n OR em.model != ?model\n OR em.dims != ?dims;' },
'embed-paginate': { title: 'Keyset Pagination', type: 'db', file: 'src/embedding/pipeline.rs:80-100', desc: '500 documents per page using keyset pagination. Each page wrapped in a SAVEPOINT.' },
'embed-chunk': { title: 'Split Into Chunks', type: 'transform', file: 'src/embedding/chunking.rs', desc: 'Splits content at paragraph boundaries with configurable max size and overlap.' },
'embed-overflow': { title: 'Overflow Guard', type: 'decision', file: 'src/embedding/pipeline.rs:110-120', desc: 'If a document produces too many chunks, it is skipped to prevent rowid collisions in the encoded chunk ID scheme.' },
'embed-work': { title: 'Build ChunkWork Items', type: 'transform', file: 'src/embedding/pipeline.rs:125-140', desc: 'Each chunk gets an encoded ID (document_id * 1000000 + chunk_index) for the sqlite-vec primary key.' },
'embed-batch': { title: 'Batch Embed via Ollama', type: 'api', file: 'src/embedding/pipeline.rs:150-200', desc: 'Sends 32 chunks per Ollama API call. Model default: nomic-embed-text.', sql: 'POST http://localhost:11434/api/embed\n{\n "model": "nomic-embed-text",\n "input": ["chunk1...", "chunk2...", ...]\n}' },
'embed-store': { title: 'Store Vectors', type: 'db', file: 'src/embedding/pipeline.rs:205-230', desc: 'Vectors stored in sqlite-vec virtual table. Metadata in embedding_metadata. Old embeddings cleared on first successful chunk.', sql: '-- Clear old embeddings:\nDELETE FROM embeddings\n WHERE rowid / 1000000 = ?doc_id;\n\n-- Insert new vector:\nINSERT INTO embeddings(rowid, embedding)\nVALUES (?chunk_id, ?vector_blob);\n\n-- Update metadata:\nINSERT INTO embedding_metadata ...\n ON CONFLICT DO UPDATE SET\n document_hash = ?,\n chunk_max_bytes = ?,\n model = ?, dims = ?;' },
'embed-success': { title: 'SAVEPOINT Commit', type: 'db', file: 'src/embedding/pipeline.rs:240-250', desc: 'Each page of 500 documents wrapped in a SAVEPOINT. Completed pages survive crashes.' },
'embed-ctx-error': { title: 'Context-Length Retry', type: 'error', file: 'src/embedding/pipeline.rs:260-280', desc: 'If Ollama returns context-length error for a batch, each chunk is retried individually to isolate the oversized one.' },
'embed-other-error': { title: 'Record Error for Retry', type: 'error', file: 'src/embedding/pipeline.rs:285-295', desc: 'Network/model errors recorded in embedding_metadata. Document detected as pending again on next run.' },
};
function escapeHtml(str) {
var div = document.createElement('div');
div.appendChild(document.createTextNode(str));
return div.textContent;
}
function buildDetailContent(d) {
var container = document.createDocumentFragment();
// Tags section
var tagSection = document.createElement('div');
tagSection.className = 'detail-section';
var typeTag = document.createElement('span');
typeTag.className = 'detail-tag type-' + d.type;
typeTag.textContent = d.type.toUpperCase();
tagSection.appendChild(typeTag);
if (d.file) {
var fileTag = document.createElement('span');
fileTag.className = 'detail-tag file';
fileTag.textContent = d.file;
tagSection.appendChild(fileTag);
}
container.appendChild(tagSection);
// Description
var descSection = document.createElement('div');
descSection.className = 'detail-section';
var descH3 = document.createElement('h3');
descH3.textContent = 'Description';
descSection.appendChild(descH3);
var descP = document.createElement('p');
descP.textContent = d.desc;
descSection.appendChild(descP);
container.appendChild(descSection);
// SQL
if (d.sql) {
var sqlSection = document.createElement('div');
sqlSection.className = 'detail-section';
var sqlH3 = document.createElement('h3');
sqlH3.textContent = 'Key Query / Code';
sqlSection.appendChild(sqlH3);
var sqlBlock = document.createElement('div');
sqlBlock.className = 'sql-block';
sqlBlock.textContent = d.sql;
sqlSection.appendChild(sqlBlock);
container.appendChild(sqlSection);
}
return container;
}
function showDetail(key) {
var d = details[key];
if (!d) return;
var panel = document.getElementById('detail-panel');
document.getElementById('detail-title').textContent = d.title;
var body = document.getElementById('detail-body');
while (body.firstChild) body.removeChild(body.firstChild);
body.appendChild(buildDetailContent(d));
document.querySelectorAll('.node.selected').forEach(function(n) { n.classList.remove('selected'); });
var clicked = document.querySelector('[data-detail="' + key + '"]');
if (clicked) clicked.classList.add('selected');
panel.classList.add('open');
}
function closeDetail() {
document.getElementById('detail-panel').classList.remove('open');
document.querySelectorAll('.node.selected').forEach(function(n) { n.classList.remove('selected'); });
}
document.addEventListener('click', function(e) {
var node = e.target.closest('.node[data-detail]');
if (node) { showDetail(node.dataset.detail); return; }
if (!e.target.closest('.detail-panel') && !e.target.closest('.node')) closeDetail();
});
document.addEventListener('keydown', function(e) { if (e.key === 'Escape') closeDetail(); });
</script>
</body>
</html>