Below are the highest-impact revisions I’d make, ordered by severity/ROI. These focus on correctness first, then security, then operability and UX. 1. **Fix multi-install ambiguity (`service_id` exists, but commands can’t target one explicitly)** Analysis: The plan introduces `service-manifest-{service_id}.json`, but `status/uninstall/resume/logs` have no selector. In a multi-workspace or multi-name install scenario, behavior becomes ambiguous and error-prone. Add explicit targeting plus discovery. ```diff @@ ## Commands & User Journeys +### `lore service list` +Lists installed services discovered from `{data_dir}/service-manifest-*.json`. +Robot output includes `service_id`, `platform`, `interval_seconds`, `profile`, `installed_at_iso`. @@ ### `lore service uninstall` -### `lore service uninstall` +### `lore service uninstall [--service ] [--all]` @@ -2. CLI reads install manifest to find `service_id` +2. CLI resolves target service via `--service` or current-project-derived default. +3. If multiple candidates and no selector, return actionable error. @@ ### `lore service status` -### `lore service status` +### `lore service status [--service ]` ``` 2. **Make status state service-scoped (not global)** Analysis: A single `sync-status.json` for all services causes cross-service contamination (pause/backoff/outcome from one profile affecting another). Keep lock global, but state per service. ```diff @@ ## Status File -### Location -`{get_data_dir()}/sync-status.json` +### Location +`{get_data_dir()}/sync-status-{service_id}.json` @@ ## Paths Module Additions -pub fn get_service_status_path() -> PathBuf { - get_data_dir().join("sync-status.json") +pub fn get_service_status_path(service_id: &str) -> PathBuf { + get_data_dir().join(format!("sync-status-{service_id}.json")) } @@ -Note: `sync-status.json` is NOT scoped by `service_id` +Note: status is scoped by `service_id`; lock remains global (`sync_pipeline`) to prevent overlapping writes. ``` 3. **Stop classifying permanence via string matching** Analysis: Matching `"401 Unauthorized"` in strings is brittle and will misclassify edge cases. Carry machine codes through stage results and classify by `ErrorCode` only. ```diff @@ pub struct StageResult { - pub error: Option, + pub error: Option, + pub error_code: Option, // e.g., AUTH_FAILED, NETWORK_ERROR } @@ Error classification helpers -fn is_permanent_error_message(msg: Option<&str>) -> bool { ...string contains... } +fn is_permanent_error_code(code: Option<&str>) -> bool { + matches!(code, Some("TOKEN_NOT_SET" | "AUTH_FAILED" | "CONFIG_NOT_FOUND" | "CONFIG_INVALID" | "MIGRATION_FAILED")) +} ``` 4. **Install should be transactional (manifest written last)** Analysis: Current order writes manifest before scheduler enable. If enable fails, you persist a false “installed” state. Use two-phase install with rollback. ```diff @@ ### `lore service install` User journey -9. CLI writes install manifest ... -10. CLI runs the platform-specific enable command +9. CLI runs the platform-specific enable command +10. On success, CLI writes install manifest atomically +11. On failure, CLI removes generated files and returns `ServiceCommandFailed` ``` 5. **Fix launchd token security gap (env-file currently still embeds token)** Analysis: Current “env-file” on macOS still writes token into plist, defeating the main security goal. Generate a private wrapper script that reads env file at runtime and execs `lore`. ```diff @@ ### macOS: launchd -ProgramArguments - - {binary_path} - --robot - service - run - +ProgramArguments + + {data_dir}/service-run-{service_id}.sh + @@ -`env-file`: ... token value must still appear in plist ... +`env-file`: token never appears in plist; wrapper loads `{data_dir}/service-env-{service_id}` at runtime. ``` 6. **Improve backoff math and add half-open circuit recovery** Analysis: Current jitter + min clamp makes first retry deterministic and can over-pause. Also circuit-breaker requires manual resume forever. Add cooldown + half-open probe to self-heal. ```diff @@ Backoff Logic -let backoff_secs = ((base_backoff as f64) * jitter_factor) as u64; -let backoff_secs = backoff_secs.max(base_interval_seconds); +let max_backoff = base_backoff; +let min_backoff = base_interval_seconds; +let span = max_backoff.saturating_sub(min_backoff); +let backoff_secs = min_backoff + ((span as f64) * jitter_factor) as u64; @@ Scheduler states -- `paused` — permanent error ... OR circuit breaker tripped ... +- `paused` — permanent error requiring intervention +- `half_open` — probe state after circuit cooldown; one trial run allowed @@ Circuit breaker -... transitions to `paused` ... Run: lore service resume +... transitions to `half_open` after cooldown (default 30m). Successful probe closes breaker automatically; failed probe returns to backoff/paused. ``` 7. **Promote backend trait to v1 (not v2) for deterministic integration tests** Analysis: This is a reliability-critical feature spanning OS schedulers. A trait abstraction now gives true behavior tests and safer refactors. ```diff @@ ### Platform Backends -> Future architecture note: A `SchedulerBackend` trait ... for v2. +Adopt `SchedulerBackend` trait in v1 with real backends (`launchd/systemd/schtasks`) and `FakeBackend` for tests. +This enables deterministic install/uninstall/status/run-path integration tests without touching host scheduler. ``` 8. **Harden `run_cmd` timeout behavior** Analysis: If timeout occurs, child process must be killed and reaped. Otherwise you leak processes and can wedge repeated runs. ```diff @@ fn run_cmd(...) -// Wait with timeout -let output = wait_with_timeout(output, timeout_secs)?; +// Wait with timeout; on timeout kill child and wait to reap +let output = wait_with_timeout_kill_and_reap(child, timeout_secs)?; ``` 9. **Add manual control commands (`pause`, `trigger`, `repair`)** Analysis: These are high-utility operational controls. `trigger` helps immediate sync without waiting interval. `pause` supports maintenance windows. `repair` avoids manual file deletion for corrupt state. ```diff @@ pub enum ServiceCommand { + /// Pause scheduled execution without uninstalling + Pause { #[arg(long)] reason: Option }, + /// Trigger an immediate one-off run using installed profile + Trigger { #[arg(long)] ignore_backoff: bool }, + /// Repair corrupt manifest/status by backing up and reinitializing + Repair { #[arg(long)] service: Option }, } ``` 10. **Make `logs` default non-interactive and add rotation policy** Analysis: Opening editor by default is awkward for automation/SSH and slower for normal diagnosis. Defaulting to `tail` is more practical; `--open` can preserve editor behavior. ```diff @@ ### `lore service logs` -By default, opens in the user's preferred editor. +By default, prints last 100 lines to stdout. +Use `--open` to open editor. @@ +Log rotation: rotate `service-stdout.log` / `service-stderr.log` at 10 MB, keep 5 files. ``` 11. **Remove destructive/shell-unsafe suggested action** Analysis: `actions(): ["rm {path}", ...]` is unsafe (shell injection + destructive guidance). Replace with safe command path. ```diff @@ LoreError::actions() -Self::ServiceCorruptState { path, .. } => vec![&format!("rm {path}"), "lore service install"], +Self::ServiceCorruptState { .. } => vec!["lore service repair", "lore service install"], ``` 12. **Tighten scheduler units for real-world reliability** Analysis: Add explicit working directory and success-exit handling to reduce environment drift and edge failures. ```diff @@ systemd service unit [Service] Type=oneshot ExecStart={binary_path} --robot service run +WorkingDirectory={data_dir} +SuccessExitStatus=0 TimeoutStartSec=900 ``` If you want, I can produce a single consolidated “v3 plan” markdown with these revisions already merged into your original structure.