gitlore/plans/lore-service.feedback-5.md

I reviewed the full plan and avoided everything already listed in `## Rejected Recommendations`. These are the highest-impact revisions I’d make.

1. **Fix identity model inconsistency and prevent `--name` alias collisions**
Why this improves the plan: your text says identity includes workspace root, but the current derivation code does not. Also, using `--name` as the actual `service_id` risks accidental cross-project collisions and destructive updates.

```diff
--- a/plan.md
+++ b/plan.md
@@ Key Design Principles / 2. Project-Scoped Service Identity
- Each installed service gets a unique `service_id` derived from a canonical identity tuple: the config file path, sorted GitLab project URLs, and workspace root.
+ Each installed service gets an immutable `identity_hash` derived from a canonical identity tuple:
+ workspace root + canonical config path + sorted normalized project URLs.
+ `service_id` remains the scheduler identifier; `--name` is a human alias only.
+ If `--name` collides with an existing service that has a different `identity_hash`, install fails with an actionable error.

@@ Install Manifest / Schema
+ /// Immutable identity hash for collision-safe matching across reinstalls
+ pub identity_hash: String,
+ /// Optional human-readable alias passed via --name
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub service_alias: Option<String>,
+ /// Canonical workspace root used in identity derivation
+ pub workspace_root: String,

@@ service_id derivation
-pub fn compute_service_id(config_path: &Path, project_urls: &[&str]) -> String
+pub fn compute_identity_hash(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String
```

2. **Add lock protocol to eliminate uninstall/run race conditions**
Why this improves the plan: today `service run` does not take admin lock, and admin commands do not take pipeline lock. `uninstall` can race with an active run and remove files mid-execution.

```diff
--- a/plan.md
+++ b/plan.md
@@ Key Design Principles / 6. Serialized Admin Mutations
- The `service run` entrypoint does NOT acquire the admin lock — it only acquires the `sync_pipeline` lock
+ The `service run` entrypoint acquires only `sync_pipeline`.
+ Destructive admin operations (`install` overwrite, `uninstall`, `repair --regenerate`) must:
+ 1) acquire `service-admin-{service_id}`
+ 2) disable scheduler backend entrypoint
+ 3) acquire `sync_pipeline` lock with timeout
+ 4) mutate/remove files
+ This lock ordering is mandatory to prevent deadlocks and run/delete races.

@@ lore service uninstall / User journey
- 4. Runs platform-specific disable command
- 5. Removes service files from disk
+ 4. Acquires `sync_pipeline` lock (after disabling scheduler) with bounded wait
+ 5. Removes service files from disk only after lock acquisition
```

3. **Make transient handling `Retry-After` aware**
Why this improves the plan: rate-limit and 503 responses often carry retry hints. Ignoring them causes useless retries and longer outages.

```diff
--- a/plan.md
+++ b/plan.md
@@ Transient vs permanent error classification
-| Transient | Retry with backoff | Network timeout, rate limited, DB locked, 5xx from GitLab |
+| Transient | Retry with adaptive backoff | Network timeout, DB locked, 5xx from GitLab |
+| Transient (hinted) | Respect server retry hint | Rate limited with Retry-After/X-RateLimit-Reset |

@@ Backoff Logic
+ If an error includes a retry hint (e.g., `Retry-After`), set:
+ `next_retry_at_ms = max(computed_backoff, hinted_retry_at_ms)`.
+ Persist `backoff_reason` for status visibility.
```

4. **Decouple optional stage cadence from core sync interval**
Why this improves the plan: running docs/embeddings every 5–30 minutes is expensive and unnecessary. Separate freshness windows reduce cost/latency while keeping core data fresh.

```diff
--- a/plan.md
+++ b/plan.md
@@ Sync profiles
-| `balanced` (default) | `--no-embed` | Issues + MRs + doc generation |
-| `full` | (none) | Full pipeline including embeddings |
+| `balanced` (default) | core every interval, docs every 60m, no embeddings | Fast + useful docs |
+| `full` | core every interval, docs every interval, embeddings every 6h (default) | Full freshness with bounded cost |

@@ Status File / StageResult
+ /// true when stage intentionally skipped due freshness window
+ #[serde(default)]
+ pub skipped: bool,

@@ lore service run / Stage-aware execution
+ Optional stages may be skipped when their last successful run is within configured freshness windows.
+ Skipped optional stages do not count as failures and are recorded explicitly.
```

5. **Give Windows parity for secure token handling (env-file + wrapper)**
Why this improves the plan: current Windows path requires global/system env and has poor UX. A wrapper+env-file model gives platform parity and avoids global token exposure.

```diff
--- a/plan.md
+++ b/plan.md
@@ Token storage strategies
-| On Windows, neither strategy applies — the token must be in the user's system environment
+| On Windows, `env-file` is supported via a generated wrapper script (`service-run-{service_id}.cmd` or `.ps1`)
+| that reads `{data_dir}/service-env-{service_id}` and launches `lore --robot service run ...`.
+| `embedded` remains opt-in and warned as less secure.

@@ Windows: schtasks
- Token handling on Windows: The env var must be set system-wide via `setx`
+ Token handling on Windows:
+ - `env-file` (default): wrapper script reads token from private file at runtime
+ - `embedded`: token passed via wrapper-set environment variable
+ - `system_env`: still supported as fallback
```

6. **Add run heartbeat and stale-run detection**
Why this improves the plan: `running` state can become misleading after crashes or stale locks. Heartbeat metadata makes status accurate and improves incident triage.

```diff
--- a/plan.md
+++ b/plan.md
@@ Status File / Schema
+ /// In-flight run metadata for crash/stale detection
+ #[serde(skip_serializing_if = "Option::is_none")]
+ pub current_run: Option<CurrentRunState>,
+
+pub struct CurrentRunState {
+    pub run_id: String,
+    pub started_at_ms: i64,
+    pub last_heartbeat_ms: i64,
+    pub pid: u32,
+}

@@ lore service status
- - `running` — currently executing (sync_pipeline lock held)
+ - `running` — currently executing with live heartbeat
+ - `running_stale` — in-flight metadata exists but heartbeat exceeded stale threshold
```

7. **Upgrade drift detection from “loaded/unloaded” to spec-level drift**
Why this improves the plan: platform state alone misses manual edits to unit/plist/wrapper files. Spec-hash drift gives reliable “what changed?” diagnostics and safe regeneration.

```diff
--- a/plan.md
+++ b/plan.md
@@ Install Manifest / Schema
+ /// Hash of generated scheduler artifacts and command spec
+ pub spec_hash: String,

@@ lore service status
- Detects manifest/platform drift and reports it
+ Detects:
+ - platform drift (loaded/unloaded mismatch)
+ - spec drift (artifact content hash mismatch)
+ - command drift (sync command differs from manifest)

@@ lore service repair
+ Add `--regenerate` to rewrite scheduler artifacts from manifest when spec drift is detected.
+ This is non-destructive and does not delete status/log history.
```

8. **Add safe operational modes: `install --dry-run` and `doctor --fix`**
Why this improves the plan: dry-run reduces risk before writing OS scheduler files; fix-mode improves operator ergonomics and lowers support burden.

```diff
--- a/plan.md
+++ b/plan.md
@@ lore service install
+ Add `--dry-run`:
+ - validates config/token/prereqs
+ - renders service files and planned commands
+ - writes nothing, executes nothing

@@ lore service doctor
+ Add `--fix` for safe, non-destructive remediations:
+ - create missing dirs
+ - correct file permissions on env/wrapper files
+ - run `systemctl --user daemon-reload` when applicable
+ - report applied fixes in robot output
```

9. **Define explicit schema migration behavior (not just `schema_version` fields)**
Why this improves the plan: version fields without migration policy become operational risk during upgrades.

```diff
--- a/plan.md
+++ b/plan.md
@@ ServiceManifest Read/Write
- `ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError>`
+ `ServiceManifest::read_and_migrate(path: &Path) -> Result<Option<Self>, LoreError>`
+ - Migrates known older schema versions to current in-memory model
+ - Rewrites migrated file atomically
+ - Fails with actionable `ServiceCorruptState` for unknown future major versions

@@ SyncStatusFile Read/Write
- `SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError>`
+ `SyncStatusFile::read_and_migrate(path: &Path) -> Result<Option<Self>, LoreError>`
```

If you want, I can produce a fully rewritten v5 plan text that integrates all nine changes coherently section-by-section.