1. **Isolate scheduled behavior from manual `sync`** Reasoning: Your current plan injects backoff into `handle_sync_cmd`, which affects all `lore sync` calls (including manual recovery runs). Scheduled behavior should be isolated so humans aren’t unexpectedly blocked by service backoff. ```diff @@ Context -`lore sync` runs a 4-stage pipeline (issues, MRs, docs, embeddings) that takes 2-4 minutes. +`lore sync` remains the manual/operator command. +`lore service run` (hidden/internal) is the scheduled execution entrypoint. @@ Commands & User Journeys +### `lore service run` (hidden/internal) +**What it does:** Executes one scheduled sync attempt with service-only policy: +- applies service backoff policy +- records service run state +- invokes sync pipeline with configured profile +- updates retry state on success/failure + +**Invocation:** scheduler always runs: +`lore --robot service run --reason timer` @@ Backoff Integration into `handle_sync_cmd` -Insert **after** config load but **before** the dry_run check: +Do not add backoff checks to `handle_sync_cmd`. +Backoff logic lives only in `handle_service_run`. ``` 2. **Use DB as source-of-truth for service state (not a standalone JSON status file)** Reasoning: You already have `sync_runs` in SQLite. A separate JSON status file creates split-brain and race/corruption risk. Keep JSON as optional cache/export only. ```diff @@ Status File -Location: `{get_data_dir()}/sync-status.json` +Primary state location: SQLite (`service_state` table) + existing `sync_runs`. +Optional mirror file: `{get_data_dir()}/sync-status.json` (best-effort export only). @@ File-by-File Implementation Details -### `src/core/sync_status.rs` (NEW) +### `migrations/015_service_state.sql` (NEW) +CREATE TABLE service_state ( + id INTEGER PRIMARY KEY CHECK (id = 1), + installed INTEGER NOT NULL DEFAULT 0, + platform TEXT, + interval_seconds INTEGER, + profile TEXT NOT NULL DEFAULT 'balanced', + consecutive_failures INTEGER NOT NULL DEFAULT 0, + next_retry_at_ms INTEGER, + last_error_code TEXT, + last_error_message TEXT, + updated_at_ms INTEGER NOT NULL +); + +### `src/core/service_state.rs` (NEW) +- read/write state row +- derive backoff/next_retry +- join with latest `sync_runs` for status output ``` 3. **Backoff policy should be configurable, jittered, and error-aware** Reasoning: Fixed hardcoded backoff (`base=1800`) is wrong when user sets another interval. Also permanent failures (bad token/config) should not burn retries forever; they should enter paused/error state. ```diff @@ Backoff Logic -// Exponential: base * 2^failures, capped at 4 hours +// Exponential with jitter: base * 2^(failures-1), capped, ±20% jitter +// Applies only to transient errors. +// Permanent errors set `paused_reason` and stop retries until user action. @@ CLI Definition Changes +ServiceCommand::Resume, // clear paused state / failures +ServiceCommand::Run, // hidden @@ Error Types +ServicePaused, // scheduler paused due to permanent error +ServiceCommandFailed, // OS command failure with stderr context ``` 4. **Add a pipeline-level single-flight lock** Reasoning: Current locking is in ingest stages; there’s still overlap risk across full sync pipelines (docs/embed can overlap with another run). Add a top-level lock for scheduled/manual sync pipeline execution. ```diff @@ Architecture +Add `sync_pipeline` lock at top-level sync execution. +Keep existing ingest lock (`sync`) for ingest internals. @@ Backoff Integration into `handle_sync_cmd` +Before starting sync pipeline, acquire `AppLock` with: +name = "sync_pipeline" +stale_lock_minutes = config.sync.stale_lock_minutes +heartbeat_interval_seconds = config.sync.heartbeat_interval_seconds ``` 5. **Don’t embed token in service files by default** Reasoning: Embedding PAT into unit/plist is a high-risk secret leak path. Make secure storage explicit and default-safe. ```diff @@ `lore service install [--interval 30m]` +`lore service install [--interval 30m] [--token-source env-file|embedded]` +Default: `env-file` (0600 perms, user-owned) +`embedded` allowed only with explicit opt-in and warning @@ Robot output - "token_embedded": true + "token_source": "env_file" @@ Human output - Note: Your GITLAB_TOKEN is embedded in the service file. + Note: Token is stored in a user-private env file (0600). ``` 6. **Introduce a command-runner abstraction with timeout + stderr capture** Reasoning: `launchctl/systemctl/schtasks` calls are failure-prone; you need consistent error mapping and deterministic tests. ```diff @@ Platform Backends -exports free functions that dispatch via `#[cfg(target_os)]` +exports backend + shared `CommandRunner`: +- run(cmd, args, timeout) +- capture stdout/stderr/exit code +- map failure to `ServiceCommandFailed { cmd, exit_code, stderr }` ``` 7. **Persist install manifest to avoid brittle file parsing** Reasoning: Parsing timer/plist for interval/state is fragile and platform-format dependent. Persist a manifest with checksums and expected artifacts. ```diff @@ Platform Backends -Same pattern for ... `get_interval_seconds()` +Add manifest: `{data_dir}/service-manifest.json` +Stores platform, interval, profile, generated files, and command. +`service status` reads manifest first, then verifies platform state. @@ Acceptance criteria +Install is idempotent: +- if manifest+files already match, report `no_change: true` +- if drift detected, reconcile and rewrite ``` 8. **Make schedule profile explicit (`fast|balanced|full`)** Reasoning: This makes the feature more useful and performance-tunable without requiring users to understand internal flags. ```diff @@ `lore service install [--interval 30m]` +`lore service install [--interval 30m] [--profile fast|balanced|full]` + +Profiles: +- fast: `sync --no-docs --no-embed` +- balanced (default): `sync --no-embed` +- full: `sync` ``` 9. **Upgrade `service status` to include scheduler health + recent run summary** Reasoning: Single last-sync snapshot is too shallow. Include recent attempts and whether scheduler is paused/backing off/running. ```diff @@ `lore service status` -What it does: Shows whether the service is installed, its configuration, last sync result, and next scheduled run. +What it does: Shows install state, scheduler state (running/backoff/paused), recent runs, and next run estimate. @@ Robot output - "last_sync": { ... }, - "backoff": null + "scheduler_state": "running|backoff|paused|idle", + "last_sync": { ... }, + "recent_runs": [{"run_id":"...","status":"...","started_at_iso":"..."}], + "backoff": null, + "paused_reason": null ``` 10. **Strengthen tests around determinism and cross-platform generation** Reasoning: Time-based backoff and shell quoting are classic flaky points. Add fake clock + fake command runner for deterministic tests. ```diff @@ Testing Strategy +Add deterministic test seams: +- `Clock` trait for backoff/now calculations +- `CommandRunner` trait for backend command execution + +Add tests: +- transient vs permanent error classification +- backoff schedule with jitter bounds +- manifest drift reconciliation +- quoting/escaping for paths with spaces and special chars +- `service run` does not modify manual `sync` behavior ``` If you want, I can rewrite your full plan as a single clean revised document with these changes already integrated (instead of patch fragments).