I reviewed the full plan and avoided everything already listed in `## Rejected Recommendations`. These are the highest-impact revisions I’d make. 1. **Fix identity model inconsistency and prevent `--name` alias collisions** Why this improves the plan: your text says identity includes workspace root, but the current derivation code does not. Also, using `--name` as the actual `service_id` risks accidental cross-project collisions and destructive updates. ```diff --- a/plan.md +++ b/plan.md @@ Key Design Principles / 2. Project-Scoped Service Identity - Each installed service gets a unique `service_id` derived from a canonical identity tuple: the config file path, sorted GitLab project URLs, and workspace root. + Each installed service gets an immutable `identity_hash` derived from a canonical identity tuple: + workspace root + canonical config path + sorted normalized project URLs. + `service_id` remains the scheduler identifier; `--name` is a human alias only. + If `--name` collides with an existing service that has a different `identity_hash`, install fails with an actionable error. @@ Install Manifest / Schema + /// Immutable identity hash for collision-safe matching across reinstalls + pub identity_hash: String, + /// Optional human-readable alias passed via --name + #[serde(skip_serializing_if = "Option::is_none")] + pub service_alias: Option, + /// Canonical workspace root used in identity derivation + pub workspace_root: String, @@ service_id derivation -pub fn compute_service_id(config_path: &Path, project_urls: &[&str]) -> String +pub fn compute_identity_hash(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String ``` 2. **Add lock protocol to eliminate uninstall/run race conditions** Why this improves the plan: today `service run` does not take admin lock, and admin commands do not take pipeline lock. `uninstall` can race with an active run and remove files mid-execution. ```diff --- a/plan.md +++ b/plan.md @@ Key Design Principles / 6. Serialized Admin Mutations - The `service run` entrypoint does NOT acquire the admin lock — it only acquires the `sync_pipeline` lock + The `service run` entrypoint acquires only `sync_pipeline`. + Destructive admin operations (`install` overwrite, `uninstall`, `repair --regenerate`) must: + 1) acquire `service-admin-{service_id}` + 2) disable scheduler backend entrypoint + 3) acquire `sync_pipeline` lock with timeout + 4) mutate/remove files + This lock ordering is mandatory to prevent deadlocks and run/delete races. @@ lore service uninstall / User journey - 4. Runs platform-specific disable command - 5. Removes service files from disk + 4. Acquires `sync_pipeline` lock (after disabling scheduler) with bounded wait + 5. Removes service files from disk only after lock acquisition ``` 3. **Make transient handling `Retry-After` aware** Why this improves the plan: rate-limit and 503 responses often carry retry hints. Ignoring them causes useless retries and longer outages. ```diff --- a/plan.md +++ b/plan.md @@ Transient vs permanent error classification -| Transient | Retry with backoff | Network timeout, rate limited, DB locked, 5xx from GitLab | +| Transient | Retry with adaptive backoff | Network timeout, DB locked, 5xx from GitLab | +| Transient (hinted) | Respect server retry hint | Rate limited with Retry-After/X-RateLimit-Reset | @@ Backoff Logic + If an error includes a retry hint (e.g., `Retry-After`), set: + `next_retry_at_ms = max(computed_backoff, hinted_retry_at_ms)`. + Persist `backoff_reason` for status visibility. ``` 4. **Decouple optional stage cadence from core sync interval** Why this improves the plan: running docs/embeddings every 5–30 minutes is expensive and unnecessary. Separate freshness windows reduce cost/latency while keeping core data fresh. ```diff --- a/plan.md +++ b/plan.md @@ Sync profiles -| `balanced` (default) | `--no-embed` | Issues + MRs + doc generation | -| `full` | (none) | Full pipeline including embeddings | +| `balanced` (default) | core every interval, docs every 60m, no embeddings | Fast + useful docs | +| `full` | core every interval, docs every interval, embeddings every 6h (default) | Full freshness with bounded cost | @@ Status File / StageResult + /// true when stage intentionally skipped due freshness window + #[serde(default)] + pub skipped: bool, @@ lore service run / Stage-aware execution + Optional stages may be skipped when their last successful run is within configured freshness windows. + Skipped optional stages do not count as failures and are recorded explicitly. ``` 5. **Give Windows parity for secure token handling (env-file + wrapper)** Why this improves the plan: current Windows path requires global/system env and has poor UX. A wrapper+env-file model gives platform parity and avoids global token exposure. ```diff --- a/plan.md +++ b/plan.md @@ Token storage strategies -| On Windows, neither strategy applies — the token must be in the user's system environment +| On Windows, `env-file` is supported via a generated wrapper script (`service-run-{service_id}.cmd` or `.ps1`) +| that reads `{data_dir}/service-env-{service_id}` and launches `lore --robot service run ...`. +| `embedded` remains opt-in and warned as less secure. @@ Windows: schtasks - Token handling on Windows: The env var must be set system-wide via `setx` + Token handling on Windows: + - `env-file` (default): wrapper script reads token from private file at runtime + - `embedded`: token passed via wrapper-set environment variable + - `system_env`: still supported as fallback ``` 6. **Add run heartbeat and stale-run detection** Why this improves the plan: `running` state can become misleading after crashes or stale locks. Heartbeat metadata makes status accurate and improves incident triage. ```diff --- a/plan.md +++ b/plan.md @@ Status File / Schema + /// In-flight run metadata for crash/stale detection + #[serde(skip_serializing_if = "Option::is_none")] + pub current_run: Option, + +pub struct CurrentRunState { + pub run_id: String, + pub started_at_ms: i64, + pub last_heartbeat_ms: i64, + pub pid: u32, +} @@ lore service status - - `running` — currently executing (sync_pipeline lock held) + - `running` — currently executing with live heartbeat + - `running_stale` — in-flight metadata exists but heartbeat exceeded stale threshold ``` 7. **Upgrade drift detection from “loaded/unloaded” to spec-level drift** Why this improves the plan: platform state alone misses manual edits to unit/plist/wrapper files. Spec-hash drift gives reliable “what changed?” diagnostics and safe regeneration. ```diff --- a/plan.md +++ b/plan.md @@ Install Manifest / Schema + /// Hash of generated scheduler artifacts and command spec + pub spec_hash: String, @@ lore service status - Detects manifest/platform drift and reports it + Detects: + - platform drift (loaded/unloaded mismatch) + - spec drift (artifact content hash mismatch) + - command drift (sync command differs from manifest) @@ lore service repair + Add `--regenerate` to rewrite scheduler artifacts from manifest when spec drift is detected. + This is non-destructive and does not delete status/log history. ``` 8. **Add safe operational modes: `install --dry-run` and `doctor --fix`** Why this improves the plan: dry-run reduces risk before writing OS scheduler files; fix-mode improves operator ergonomics and lowers support burden. ```diff --- a/plan.md +++ b/plan.md @@ lore service install + Add `--dry-run`: + - validates config/token/prereqs + - renders service files and planned commands + - writes nothing, executes nothing @@ lore service doctor + Add `--fix` for safe, non-destructive remediations: + - create missing dirs + - correct file permissions on env/wrapper files + - run `systemctl --user daemon-reload` when applicable + - report applied fixes in robot output ``` 9. **Define explicit schema migration behavior (not just `schema_version` fields)** Why this improves the plan: version fields without migration policy become operational risk during upgrades. ```diff --- a/plan.md +++ b/plan.md @@ ServiceManifest Read/Write - `ServiceManifest::read(path: &Path) -> Result, LoreError>` + `ServiceManifest::read_and_migrate(path: &Path) -> Result, LoreError>` + - Migrates known older schema versions to current in-memory model + - Rewrites migrated file atomically + - Fails with actionable `ServiceCorruptState` for unknown future major versions @@ SyncStatusFile Read/Write - `SyncStatusFile::read(path: &Path) -> Result, LoreError>` + `SyncStatusFile::read_and_migrate(path: &Path) -> Result, LoreError>` ``` If you want, I can produce a fully rewritten v5 plan text that integrates all nine changes coherently section-by-section.