gitlore/plans/lore-service.md

---
plan: true
title: ""
status: iterating
iteration: 5
target_iterations: 8
beads_revision: 0
related_plans: []
created: 2026-02-09
updated: 2026-02-11
---

# Plan: `lore service` — OS-Native Scheduled Sync

## Context

`lore sync` runs a 4-stage pipeline (issues, MRs, docs, embeddings) that takes 2-4 minutes. Today it must be invoked manually. We want `lore service install` to set up OS-native scheduled execution automatically, with exponential backoff on failures, a circuit breaker for persistent transient errors, stage-aware outcome tracking, and a status file for observability. This is the first nested subcommand in the project.

### Key Design Principles

#### 1. Separation of Manual and Scheduled Execution

`lore sync` remains the manual/operator command. It is never subject to backoff, pausing, or service-level policy. A separate hidden entrypoint — `lore service run --service-id <id>` — is what the OS scheduler actually invokes. This entrypoint applies service-specific policy (backoff, error classification, pipeline locking) before delegating to the sync pipeline. This separation ensures that a human running `lore sync` to debug or recover is never unexpectedly blocked by service state. The `--service-id` parameter ensures unambiguous manifest/status file selection when multiple services are installed.

#### 2. Project-Scoped Service Identity

Each installed service gets a unique `service_id` derived from a canonical identity tuple: the workspace root, config file path, and sorted GitLab project URLs. This composite fingerprint prevents collisions even when multiple workspaces share a single global config file — the identity represents *what* is being synced and *where*, not just the config location. The hash uses 12 hex characters (48 bits) for collision safety. An optional `--name` flag allows explicit naming for human readability; if `--name` collides with an existing service that has a different identity hash (different workspace/config/projects), install fails with an actionable error listing the conflict.

#### 3. Stage-Aware Outcome Tracking

The sync pipeline has stages of differing criticality. Issues and MRs are **core** — their failure constitutes a hard failure. Docs and embeddings are **optional** — their failure produces a **degraded** outcome but does not trigger backoff or pause. This ensures data freshness for the most important entities even when peripheral stages have transient problems.

#### 4. Resilient Failure Handling

Errors are classified as transient (retry with backoff) or permanent (pause until user intervention). A **circuit breaker** trips after a configurable number of consecutive transient failures (default: 10), transitioning to a `half_open` probe state after a cooldown period (default: 30 minutes). In `half_open`, one trial run is allowed — if it succeeds, the breaker closes automatically; if it fails, the breaker returns to `paused` state requiring manual `lore service resume`. This provides self-healing for systemic but recoverable failures (DNS outages, temporary GitLab maintenance) while still halting on truly persistent problems.

#### 5. Transactional Install

The install process is two-phase: service files are generated and the platform-specific enable command is run first. Only on success is the install manifest written atomically. If the enable command fails, generated files are cleaned up and no manifest is persisted. This prevents a false "installed" state when the scheduler rejects the service configuration.

#### 6. Serialized Admin Mutations

All commands that mutate service state (install, uninstall, pause, resume, repair) acquire an admin-level lock — `AppLock("service-admin-{service_id}")` — before reading or writing manifest/status files. This prevents races between concurrent admin commands (e.g., a user running `service pause` while an automated tool runs `service resume`). The admin lock is separate from the `sync_pipeline` lock, which guards the data pipeline. Legal state transitions:

- `idle` -> `running` -> `success` | `degraded` | `backoff` | `paused`
- `backoff` -> `running` | `paused`
- `paused` -> `half_open` | `running` (via `resume`)
- `half_open` -> `running` | `paused`

Any transition not in this table is rejected with `ServiceCorruptState`. The `service run` entrypoint does NOT acquire the admin lock — it only acquires the `sync_pipeline` lock to avoid overlapping data writes.

---

## Commands & User Journeys

### `lore service install [--interval 30m] [--profile balanced] [--token-source env-file] [--name <optional>] [--dry-run]`

**What it does:** Generates and installs an OS-native scheduled task that runs `lore --robot service run --service-id <service_id>` at the specified interval, with the chosen sync profile, token storage strategy, and a project-scoped identity to avoid collisions across workspaces.

**User journey:**
1. User runs `lore service install --interval 15m --profile fast`
2. CLI loads config to read `gitlab.tokenEnvVar` (default: `GITLAB_TOKEN`)
3. CLI resolves the token value from the current environment
4. CLI computes or reads `service_id`:
   - If `--name` is provided, use it (sanitized to `[a-z0-9-]`)
   - Otherwise, derive from a composite fingerprint of (workspace root + config path + sorted project URLs) — first 12 hex chars of SHA-256
   - This becomes the suffix for all platform-specific identifiers (launchd label, systemd unit name, Windows task name)
5. CLI resolves its own binary path via `std::env::current_exe()?.canonicalize()?`
6. CLI writes the token to a user-private env file (`{data_dir}/service-env-{service_id}`, mode 0600) unless `--token-source embedded` is explicitly passed
7. CLI generates the platform-specific service files (referencing `lore --robot service run --service-id <service_id>`, NOT `lore sync`)
8. CLI writes service files to disk
9. CLI runs the platform-specific enable command
10. On success: CLI writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) to `{data_dir}/service-manifest-{service_id}.json`
11. On failure: CLI removes generated service files, env file, wrapper script, and temp manifest — returns `ServiceCommandFailed` with stderr context
12. CLI outputs success with details of what was installed

**Sync profiles:**

| Profile | Sync flags | Use case |
|---------|-----------|----------|
| `fast` | `--no-docs --no-embed` | Minimal: issues + MRs only |
| `balanced` (default) | `--no-embed` | Issues + MRs + doc generation |
| `full` | (none) | Full pipeline including embeddings |

The profile determines what flags are passed to the underlying sync command. The scheduler invocation is always `lore --robot service run --service-id <service_id>`, which reads the profile from the install manifest and constructs the appropriate sync flags.

**Token storage strategies:**

| Strategy | Behavior | Security | Platforms |
|----------|----------|----------|-----------|
| `env-file` (default) | Token written to `{data_dir}/service-env-{service_id}` with 0600 permissions. On Linux/systemd, referenced via `EnvironmentFile=` (true file-based loading). On macOS/launchd, a wrapper shell script (mode 0700) sources the env file at runtime and execs `lore` — the token never appears in the plist. | Token file only readable by owner. Canonical source is the env file; `lore service install` re-reads it on regeneration. | macOS, Linux |
| `embedded` | Token embedded directly in service file. Requires explicit `--token-source embedded` flag. CLI prints a security warning. | Less secure: token visible in plist/unit file. | macOS, Linux |

On Windows, neither strategy applies — the token must be in the user's system environment (set via `setx` or system settings). `token_source` is reported as `"system_env"`. This is documented as a requirement in `lore service install` output on Windows.

> **Note on macOS wrapper script approach:** launchd cannot natively load environment files. Rather than embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script (`{data_dir}/service-run-{service_id}.sh`, mode 0700) that sources the env file and execs `lore`. The plist's `ProgramArguments` points to the wrapper script, keeping the token out of the plist entirely. On Linux/systemd, `EnvironmentFile=` provides native file-based loading without any wrapper needed.
>
> **Future enhancement:** On macOS, Keychain integration could eliminate the env file entirely. On Windows, Credential Manager could replace the system environment requirement. These are deferred to a future iteration to avoid adding platform-specific secure store dependencies (`security-framework`, `winapi`) in v1.

**Acceptance criteria:**
- Parses interval strings: `5m`, `15m`, `30m`, `1h`, `2h`, `12h`, `24h`
- Rejects intervals < 5 minutes or > 24 hours
- Rejects non-numeric or malformed intervals with clear error messages
- Computes `service_id` from composite fingerprint (workspace root + config path + project URLs) or `--name` flag; sanitizes to `[a-z0-9-]`. If `--name` collides with an existing service with a different identity hash, returns an actionable error.
- If already installed (manifest exists for this `service_id`): reads existing manifest. If config matches, reports `no_change: true`. If config differs, overwrites and reports what changed.
- If `GITLAB_TOKEN` (or configured env var) is not set, fails with `TokenNotSet` error
- If `current_exe()` fails, returns `ServiceError`
- Creates parent directories for service files if they don't exist
- Writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) alongside service files
- Runs `service doctor` checks as a pre-flight: validates scheduler prerequisites (e.g., systemd user manager/linger on Linux, GUI session context on macOS) and surfaces warnings or errors before installing
- `--dry-run`: validates config/token/prereqs, renders service files and planned commands, but writes nothing and executes nothing. Robot output includes `"dry_run": true` and the rendered service file content for inspection.
- Robot mode outputs `{"ok":true,"data":{...},"meta":{"elapsed_ms":N}}`
- Human mode outputs a clear summary with file paths and next steps

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "platform": "launchd",
    "service_id": "a1b2c3d4e5f6",
    "interval_seconds": 900,
    "profile": "fast",
    "binary_path": "/usr/local/bin/lore",
    "config_path": null,
    "service_files": ["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist"],
    "sync_command": "/usr/local/bin/lore --robot service run --service-id a1b2c3d4e5f6",
    "token_env_var": "GITLAB_TOKEN",
    "token_source": "env_file",
    "no_change": false
  },
  "meta": { "elapsed_ms": 42 }
}
```

**Human output:**
```
Service installed:
  Platform:  launchd
  Service ID: a1b2c3d4e5f6
  Interval:  15m (900s)
  Profile:   fast (--no-docs --no-embed)
  Binary:    /usr/local/bin/lore
  Service:   ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
  Command:   lore --robot service run --service-id a1b2c3d4e5f6
  Token:     stored in ~/.local/share/lore/service-env-a1b2c3d4e5f6 (0600)

  To rotate your token: lore service install
```

---

### `lore service list`

**What it does:** Lists all installed services discovered from `{data_dir}/service-manifest-*.json` files. Useful when managing multiple gitlore workspaces to see all active installations at a glance.

**User journey:**
1. User runs `lore service list`
2. CLI scans `{data_dir}` for files matching `service-manifest-*.json`
3. Reads each manifest and verifies platform state
4. Outputs summary of all installed services

**Acceptance criteria:**
- Returns empty list (not error) when no services installed
- Shows `service_id`, `platform`, `interval`, `profile`, `installed_at_iso` for each
- Verifies platform state matches manifest (flags drift)
- Robot and human output modes

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "services": [
      {
        "service_id": "a1b2c3d4e5f6",
        "platform": "launchd",
        "interval_seconds": 900,
        "profile": "fast",
        "installed_at_iso": "2026-02-09T10:00:00Z",
        "platform_state": "loaded",
        "drift": false
      }
    ]
  },
  "meta": { "elapsed_ms": 15 }
}
```

**Human output:**
```
Installed services:
  a1b2c3d4e5f6  launchd  15m  fast   installed 2026-02-09  loaded
```

Or when none installed:
```
No services installed. Run: lore service install
```

---

### `lore service uninstall [--service <service_id|name>] [--all]`

**What it does:** Disables and removes the scheduled task, its manifest, and its token env file.

**User journey:**
1. User runs `lore service uninstall`
2. CLI resolves target service: uses `--service` if provided, otherwise derives `service_id` from current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services with `lore service list`.
3. If manifest doesn't exist, checks platform directly; if not installed, exits cleanly with informational message (exit 0, not an error)
4. Runs platform-specific disable command
5. Removes service files from disk
6. Removes install manifest (`service-manifest-{service_id}.json`)
7. Removes token env file (`service-env-{service_id}`) if it exists
8. Does NOT remove the status file or log files (those are operational data, not config)
9. Outputs confirmation

**Acceptance criteria:**
- Idempotent: running when not installed is not an error
- Removes ALL service files (timer + service on systemd), the install manifest, and the token env file
- Does NOT remove the status file or log files (those are data, not config)
- If platform disable command fails (e.g., service was already unloaded), still removes files and succeeds
- Robot and human output modes

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "was_installed": true,
    "service_id": "a1b2c3d4e5f6",
    "platform": "launchd",
    "removed_files": [
      "/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist",
      "/Users/x/.local/share/lore/service-manifest-a1b2c3d4e5f6.json",
      "/Users/x/.local/share/lore/service-env-a1b2c3d4e5f6"
    ]
  },
  "meta": { "elapsed_ms": 15 }
}
```

**Human output:**
```
Service uninstalled (a1b2c3d4e5f6):
  Removed: ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
  Removed: ~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json
  Removed: ~/.local/share/lore/service-env-a1b2c3d4e5f6
  Kept:    ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json (run history)
  Kept:    ~/.local/share/lore/logs/ (service logs)
```

Or if not installed:
```
Service is not installed. Nothing to do.
```

---

### `lore service status [--service <service_id|name>]`

**What it does:** Shows install state, scheduler state (running/backoff/paused/half_open/idle), last sync result, recent run history, and next run estimate. Resolves target service via `--service` flag or current-project-derived default.

**User journey:**
1. User runs `lore service status`
2. CLI resolves target service: uses `--service` if provided, otherwise derives `service_id` from current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services with `lore service list`.
3. CLI reads install manifest from `{data_dir}/service-manifest-{service_id}.json`
4. If installed, verifies platform state matches manifest (detects drift)
5. Reads `{data_dir}/sync-status-{service_id}.json` for last sync and recent run history
6. Queries platform for service state and next run time
7. Computes scheduler state from status file + backoff logic
8. Outputs combined status

**Scheduler states:**
- `idle` — installed but no runs yet
- `running` — currently executing (sync_pipeline lock held, `current_run` metadata present with recent `started_at_ms`)
- `running_stale` — `current_run` metadata exists but the process (by PID) is no longer alive, or `started_at_ms` is older than 30 minutes. Indicates a crashed or killed previous run. `lore service status` reports this with the stale run's start time and PID for diagnostics.
- `degraded` — last run completed but one or more optional stages failed (docs/embeddings). Core data (issues/MRs) is fresh.
- `backoff` — transient failures, waiting to retry
- `half_open` — circuit breaker cooldown expired; one probe run is allowed. If it succeeds, the breaker closes automatically and state returns to normal. If it fails, state transitions to `paused`.
- `paused` — permanent error detected (bad token, config error) OR circuit breaker tripped and probe failed. Requires user intervention via `lore service resume`.
- `not_installed` — service not installed

**Acceptance criteria:**
- Works even if service is not installed (shows `installed: false`, `scheduler_state: "not_installed"`)
- Works even if status file doesn't exist (shows `last_sync: null`)
- Shows backoff state with remaining time if in backoff
- Shows paused reason if in paused state
- Includes recent runs summary (last 5 runs)
- Shows next scheduled run if determinable from platform
- Detects drift at multiple levels:
  - **Platform drift:** loaded/unloaded mismatch between manifest and OS scheduler
  - **Spec drift:** SHA-256 hash of service file content on disk doesn't match `spec_hash` in manifest (detects manual edits to plist/unit files)
  - **Command drift:** sync command in service file differs from manifest's `sync_command`
- Exit code 0 always (status is informational)

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "installed": true,
    "service_id": "a1b2c3d4e5f6",
    "platform": "launchd",
    "interval_seconds": 1800,
    "profile": "balanced",
    "service_state": "loaded",
    "scheduler_state": "running",
    "last_sync": {
      "timestamp_iso": "2026-02-09T10:30:00.000Z",
      "duration_seconds": 12.5,
      "outcome": "success",
      "stage_results": [
        { "stage": "issues", "success": true, "items_updated": 5 },
        { "stage": "mrs", "success": true, "items_updated": 3 },
        { "stage": "docs", "success": true, "items_updated": 12 }
      ],
      "consecutive_failures": 0
    },
    "recent_runs": [
      { "timestamp_iso": "2026-02-09T10:30:00Z", "outcome": "success", "duration_seconds": 12.5 },
      { "timestamp_iso": "2026-02-09T10:00:00Z", "outcome": "success", "duration_seconds": 11.8 }
    ],
    "backoff": null,
    "paused_reason": null,
    "drift": {
      "platform_drift": false,
      "spec_drift": false,
      "command_drift": false
    }
  },
  "meta": { "elapsed_ms": 15 }
}
```

When degraded (optional stages failed):
```json
"scheduler_state": "degraded",
"last_sync": {
  "outcome": "degraded",
  "stage_results": [
    { "stage": "issues", "success": true, "items_updated": 5 },
    { "stage": "mrs", "success": true, "items_updated": 3 },
    { "stage": "docs", "success": false, "error": "I/O error writing documents" }
  ]
}
```

When in backoff:
```json
"scheduler_state": "backoff",
"backoff": {
  "consecutive_failures": 3,
  "next_retry_iso": "2026-02-09T14:30:00.000Z",
  "remaining_seconds": 7200
}
```

When paused (permanent error):
```json
"scheduler_state": "paused",
"paused_reason": "AUTH_FAILED: GitLab returned 401 Unauthorized. Run: lore service resume"
```

When paused (circuit breaker):
```json
"scheduler_state": "paused",
"paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures (last: NetworkError). Run: lore service resume"
```

When in half-open (circuit breaker cooldown expired, probe pending):
```json
"scheduler_state": "half_open",
"backoff": {
  "consecutive_failures": 10,
  "circuit_breaker_cooldown_expired": true,
  "message": "Circuit breaker cooldown expired. Next run will be a probe attempt."
}
```

**Human output:**
```
Service status (a1b2c3d4e5f6):
  Installed:   yes
  Platform:    launchd
  Interval:    30m (1800s)
  Profile:     balanced
  State:       loaded
  Scheduler:   running

Last sync:
  Time:        2026-02-09 10:30:00 UTC
  Duration:    12.5s
  Outcome:     success
  Stages:      issues (5), mrs (3), docs (12)
  Failures:    0 consecutive

Recent runs (last 5):
  10:30 UTC  success   12.5s
  10:00 UTC  success   11.8s
```

When degraded:
```
  Scheduler:   DEGRADED
               Core stages OK: issues (5), mrs (3)
               Failed stages: docs (I/O error writing documents)
               Core data is fresh. Optional stages will retry next run.
```

When paused (permanent error):
```
  Scheduler:   PAUSED - AUTH_FAILED
               GitLab returned 401 Unauthorized
               Fix: rotate token, then run: lore service resume
```

When paused (circuit breaker):
```
  Scheduler:   PAUSED - CIRCUIT_BREAKER
               10 consecutive transient failures (last: NetworkError)
               Fix: check network/GitLab availability, then run: lore service resume
```

When half-open (circuit breaker cooldown expired):
```
  Scheduler:   HALF_OPEN
               Circuit breaker cooldown expired. Next run will probe.
               If probe succeeds, scheduler returns to normal.
```

---

### `lore service logs [--tail <n>] [--follow] [--open] [--service <service_id|name>]`

**What it does:** Displays or streams the service log file. By default, prints the last 100 lines to stdout. With `--tail <n>`, shows the last N lines. With `--follow`, streams new lines as they arrive (like `tail -f`). With `--open`, opens in the user's preferred editor.

**User journey (default):**
1. User runs `lore service logs`
2. CLI determines log path: `{data_dir}/logs/service-{service_id}-stderr.log`
3. CLI checks if file exists; if not, outputs "No log file found yet" with the expected path
4. Prints last 100 lines to stdout

**User journey (--open):**
1. User runs `lore service logs --open`
2. CLI determines editor: `$VISUAL` -> `$EDITOR` -> `less` (Unix) / `notepad` (Windows)
3. Spawns editor as child process, waits for exit
4. Exits with editor's exit code

**User journey (--tail / --follow):**
1. User runs `lore service logs --tail 50` or `lore service logs --follow`
2. CLI reads the last N lines or streams with follow
3. Outputs directly to stdout

**Log rotation:** Rotate `service-{service_id}-stdout.log` and `service-{service_id}-stderr.log` at 10 MB, keeping 5 rotated files. Rotation is checked at 10 MB, not at every write. This avoids creating many small files and prevents log file explosion.

**Acceptance criteria:**
- Default (no flags): prints last 100 lines to stdout
- `--open`: Falls back through `VISUAL` -> `EDITOR` -> `less` -> `notepad`. If no editor and no `less` available, returns `ServiceError` with suggestion.
- `--tail <n>` shows last N lines (default 100 if no value), exits immediately
- `--follow` streams new log lines until Ctrl-C (like `tail -f`); mutually exclusive with `--open`
- `--tail` and `--follow` can be combined: show last N lines then follow
- In robot mode, outputs the log file path and optionally last N lines as JSON (never opens editor)

**Robot output (does not open editor):**
```json
{
  "ok": true,
  "data": {
    "log_path": "/Users/x/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log",
    "exists": true,
    "size_bytes": 4096,
    "last_lines": ["2026-02-09T10:30:00Z sync completed in 12.5s", "..."]
  },
  "meta": { "elapsed_ms": 1 }
}
```

The `last_lines` field is included when `--tail` is specified in robot mode (capped at 100 lines to avoid bloated JSON). Without `--tail`, only path metadata is returned. `--follow` is not supported in robot mode (returns error: "follow mode requires interactive terminal").

---

### `lore service doctor`

**What it does:** Validates that the service environment is healthy: scheduler prerequisites, token validity, file permissions, config accessibility, and platform-specific readiness.

**User journey:**
1. User runs `lore service doctor` (or it runs automatically as a pre-flight during `service install`)
2. CLI runs a series of diagnostic checks and reports pass/warn/fail for each

**Diagnostic checks:**
1. **Config accessible** — Can load and parse `config.json`
2. **Token present** — Configured env var is set and non-empty
3. **Token valid** — Quick auth test against GitLab API (optional, skipped with `--offline`)
4. **Binary path** — `current_exe()` resolves and is executable
5. **Data directory** — Writable by current user
6. **Platform prerequisites:**
   - **macOS:** Running in a GUI login session (launchd bootstrap domain is `gui/{uid}`, not `system`)
   - **Linux:** `systemctl --user` is available; user manager is running; `loginctl enable-linger` is active (required for timers to fire when user is not logged in)
   - **Windows:** `schtasks` is available
7. **Existing install** — If manifest exists, verify platform state matches (drift detection)

**Acceptance criteria:**
- Each check reports: `pass`, `warn`, or `fail`
- Warnings are non-blocking (e.g., linger not enabled — timer works when logged in but not on reboot)
- Failures are blocking for `service install` (install aborts with actionable message)
- `--offline` skips network checks (token validation)
- `--fix` attempts safe, non-destructive remediations for fixable issues: create missing directories, correct file permissions on env/wrapper files (0600/0700), run `systemctl --user daemon-reload` when applicable. Reports each applied fix in the output. Does NOT attempt fixes that could cause data loss.
- Exit code: 0 if all pass/warn, non-zero if any fail

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "checks": [
      { "name": "config", "status": "pass" },
      { "name": "token_present", "status": "pass" },
      { "name": "token_valid", "status": "pass" },
      { "name": "binary_path", "status": "pass" },
      { "name": "data_directory", "status": "pass" },
      { "name": "platform_prerequisites", "status": "warn", "message": "loginctl linger not enabled; timer will not fire on reboot without active session", "action": "loginctl enable-linger $(whoami)" },
      { "name": "install_state", "status": "pass" }
    ],
    "overall": "warn"
  },
  "meta": { "elapsed_ms": 850 }
}
```

**Human output:**
```
Service doctor:
  [PASS] Config loaded from ~/.config/lore/config.json
  [PASS] GITLAB_TOKEN is set
  [PASS] GitLab authentication successful
  [PASS] Binary: /usr/local/bin/lore
  [PASS] Data dir: ~/.local/share/lore/ (writable)
  [WARN] loginctl linger not enabled
         Timer will not fire on reboot without active session
         Fix: loginctl enable-linger $(whoami)
  [PASS] No existing install detected

  Overall: WARN (1 warning)
```

---

### `lore service run` (hidden/internal)

**What it does:** Executes one scheduled sync attempt with full service-level policy. This is the command the OS scheduler actually invokes — users should never need to call it directly.

**Invocation by scheduler:** `lore --robot service run --service-id <service_id>`

**Execution flow:**
1. Read install manifest for the given `service_id` to determine profile, interval, and circuit breaker config
2. Read status file (service-scoped)
3. If paused (not half_open): check if circuit breaker cooldown has expired. If cooldown expired, transition to `half_open` and allow probe (continue to step 5). If cooldown still active or paused for permanent error, log reason, write status, exit 0.
4. If in backoff window: log skip reason, write status, exit 0
5. Acquire `sync_pipeline` `AppLock` (prevents overlap with manual sync or another scheduled run)
6. If lock acquisition fails (another sync running): log, exit 0
7. Execute sync pipeline with flags derived from profile
8. On success: reset `consecutive_failures` to 0, write status, release lock
9. On transient failure: increment `consecutive_failures`, compute next backoff, write status, release lock
10. On permanent failure: set `paused_reason`, write status, release lock

**Stage-aware execution:**

The sync pipeline is executed stage-by-stage, with each stage's outcome recorded independently:

| Stage | Criticality | Failure behavior | In-run retry |
|-------|-------------|-----------------|--------------|
| `issues` | **core** | Hard failure — triggers backoff/pause | 1 retry on transient errors (1-5s jittered delay) |
| `mrs` | **core** | Hard failure — triggers backoff/pause | 1 retry on transient errors (1-5s jittered delay) |
| `docs` | **optional** | Degraded outcome — logged but does not trigger backoff | No retry (best-effort) |
| `embeddings` | **optional** | Degraded outcome — logged but does not trigger backoff | No retry (best-effort) |

**In-run retries for core stages:** Before counting a core stage failure toward backoff/circuit-breaker, the service runner retries the stage once with a jittered delay of 1-5 seconds. This absorbs transient network blips (DNS hiccups, momentary 5xx responses) without extending run duration significantly. Only transient errors are retried — permanent errors (bad token, config errors) are never retried. If the retry succeeds, the stage is recorded as successful. If both attempts fail, the final error is used for classification. This significantly reduces false backoff triggers from brief network interruptions.

If all core stages succeed (potentially after retry) but optional stages fail, the run outcome is `"degraded"` — consecutive failures are NOT incremented, and the scheduler state reflects `degraded` rather than `backoff`. This ensures data freshness for the most important entities even when peripheral stages have transient problems.

**Transient vs permanent error classification:**

| Error type | Classification | Examples |
|-----------|---------------|----------|
| Transient | Retry with backoff | Network timeout, DB locked, 5xx from GitLab |
| Transient (hinted) | Respect server retry hint | Rate limited with `Retry-After` or `X-RateLimit-Reset` header |
| Permanent | Pause until user action | 401 Unauthorized (bad token), config not found, config invalid, migration failed |

The classification is determined by the `ErrorCode` of the underlying `LoreError`:
- Permanent: `TokenNotSet`, `AuthFailed`, `ConfigNotFound`, `ConfigInvalid`, `MigrationFailed`
- Transient: everything else (`NetworkError`, `RateLimited`, `DbLocked`, `DbError`, `InternalError`, etc.)

# Key design decisions:
- `next_retry_at_ms` is computed once on failure and persisted — `service status` simply reads it for stable, consistent display
- **Retry-After awareness:** If a transient error includes a server-provided retry hint (e.g., `Retry-After` header on 429 responses, `X-RateLimit-Reset` on GitLab rate limits), the backoff is set to `max(computed_backoff, hinted_retry_at)`. This prevents useless retries during rate-limit windows and respects GitLab's guidance. The `backoff_reason` field (if present) indicates whether the backoff was server-hinted.
- Backoff base is the configured interval, not a hardcoded 1800s — a user with `--interval 5m` gets shorter backoffs than one with `--interval 1h`
- Optional stage failures produce `degraded` outcome without triggering backoff
- Respects backoff window from previous failures (reads `next_retry_at_ms` from status file)
- Pauses on permanent errors instead of burning retries
- Trips circuit breaker after 10 consecutive transient failures
- Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)

**Circuit breaker (with half-open recovery):**

After `max_transient_failures` consecutive transient failures (default: 10), the service transitions to `paused` state with reason `CIRCUIT_BREAKER`. However, instead of requiring manual intervention forever, the circuit breaker enters a `half_open` state after a cooldown period (`circuit_breaker_cooldown_seconds`, default: 1800 = 30 minutes).

In `half_open`, the next `service run` invocation is allowed to proceed as a **probe**:
- If the probe succeeds or returns `degraded`, the circuit breaker closes automatically: `consecutive_failures` resets to 0, `paused_reason` is cleared, and normal operation resumes.
- If the probe fails, the circuit breaker returns to `paused` state with an updated `circuit_breaker_paused_at_ms` timestamp, starting another cooldown period.

This provides self-healing for recoverable systemic failures (DNS outages, GitLab maintenance windows) without requiring manual `lore service resume` for every transient hiccup. Truly persistent problems (bad token, config corruption) are caught by the permanent error classifier and go directly to `paused` without the half-open mechanism.

The `circuit_breaker_cooldown_seconds` is stored in the manifest alongside `max_transient_failures`. Both are hardcoded defaults for v1 (10 failures, 30-minute cooldown) but can be made configurable in a future iteration.

**Acceptance criteria:**
- Hidden from `--help` (use `#[command(hide = true)]`)
- Always runs in robot mode regardless of `--robot` flag
- Acquires pipeline-level lock before executing sync
- Executes stages independently and records per-stage outcomes
- Retries transient core stage failures once (1-5s jittered delay) before counting as failed
- Permanent core stage errors are never retried — immediate pause
- Classifies core stage errors as transient or permanent
- Optional stage failures produce `degraded` outcome without triggering backoff
- Respects backoff window from previous failures (reads `next_retry_at_ms` from status file)
- Pauses on permanent errors instead of burning retries
- Trips circuit breaker after 10 consecutive transient failures
- Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)

**Robot output (success):**
```json
{
  "ok": true,
  "data": {
    "action": "sync_completed",
    "outcome": "success",
    "profile": "balanced",
    "duration_seconds": 45.2,
    "stage_results": [
      { "stage": "issues", "success": true, "items_updated": 12 },
      { "stage": "mrs", "success": true, "items_updated": 4 },
      { "stage": "docs", "success": true, "items_updated": 28 }
    ],
    "consecutive_failures": 0
  },
  "meta": { "elapsed_ms": 45200 }
}
```

**Robot output (degraded — optional stages failed):**
```json
{
  "ok": true,
  "data": {
    "action": "sync_completed",
    "outcome": "degraded",
    "profile": "full",
    "duration_seconds": 38.1,
    "stage_results": [
      { "stage": "issues", "success": true, "items_updated": 12 },
      { "stage": "mrs", "success": true, "items_updated": 4 },
      { "stage": "docs", "success": true, "items_updated": 28 },
      { "stage": "embeddings", "success": false, "error": "Ollama unavailable" }
    ],
    "consecutive_failures": 0
  },
  "meta": { "elapsed_ms": 38100 }
}
```

**Robot output (skipped — backoff):**
```json
{
  "ok": true,
  "data": {
    "action": "skipped",
    "reason": "backoff",
    "consecutive_failures": 3,
    "next_retry_iso": "2026-02-09T14:30:00.000Z",
    "remaining_seconds": 1842
  },
  "meta": { "elapsed_ms": 1 }
}
```

**Robot output (paused — permanent error):**
```json
{
  "ok": true,
  "data": {
    "action": "paused",
    "reason": "AUTH_FAILED",
    "message": "GitLab returned 401 Unauthorized",
    "suggestion": "Rotate token, then run: lore service resume"
  },
  "meta": { "elapsed_ms": 1200 }
}
```

**Robot output (paused — circuit breaker):**
```json
{
  "ok": true,
  "data": {
    "action": "paused",
    "reason": "CIRCUIT_BREAKER",
    "message": "10 consecutive transient failures (last: NetworkError: connection refused)",
    "consecutive_failures": 10,
    "suggestion": "Check network/GitLab availability, then run: lore service resume"
  },
  "meta": { "elapsed_ms": 1200 }
}
```

---

### `lore service resume [--service <service_id|name>]`

**What it does:** Clears the paused state (including half-open circuit breaker) and resets consecutive failures, allowing the scheduler to retry on the next interval.

**User journey:**
1. User sees `lore service status` reports `scheduler: PAUSED`
2. User fixes the underlying issue (rotates token, fixes config, etc.)
3. User runs `lore service resume`
4. CLI resets `consecutive_failures` to 0, clears `paused_reason` and `last_error_*` fields
5. Next scheduled `service run` will attempt sync normally

**Acceptance criteria:**
- If not paused, exits cleanly with informational message ("Service is not paused")
- If not installed, exits cleanly with informational message ("Service is not installed")
- Does NOT trigger an immediate sync (just clears state — scheduler handles the next run)
- Robot and human output modes

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "was_paused": true,
    "previous_reason": "AUTH_FAILED",
    "consecutive_failures_cleared": 5
  },
  "meta": { "elapsed_ms": 2 }
}
```

**Human output:**
```
Service resumed:
  Previous state: PAUSED (AUTH_FAILED)
  Failures cleared: 5
  Next sync will run at the scheduled interval.
```

Or for circuit breaker:
```
Service resumed:
  Previous state: PAUSED (CIRCUIT_BREAKER, 10 transient failures)
  Failures cleared: 10
  Next sync will run at the scheduled interval.
```

---

### `lore service pause [--reason <text>] [--service <service_id|name>]`

**What it does:** Pauses scheduled execution without uninstalling the service. Useful for maintenance windows, debugging, or temporarily stopping syncs while the underlying infrastructure is being modified.

**User journey:**
1. User runs `lore service pause --reason "GitLab maintenance window"`
2. CLI writes `paused_reason` to the status file with the provided reason (or "Manually paused" if no reason given)
3. Next `service run` will see the paused state and exit immediately

**Acceptance criteria:**
- Sets `paused_reason` in the status file
- Does NOT modify the OS scheduler (service remains installed and scheduled — it just no-ops)
- If already paused, updates the reason and reports `already_paused: true`
- `lore service resume` clears the pause (same as for other paused states)
- Robot and human output modes

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "service_id": "a1b2c3d4e5f6",
    "paused": true,
    "reason": "GitLab maintenance window",
    "already_paused": false
  },
  "meta": { "elapsed_ms": 2 }
}
```

**Human output:**
```
Service paused (a1b2c3d4e5f6):
  Reason: GitLab maintenance window
  Resume with: lore service resume
```

---

### `lore service trigger [--ignore-backoff] [--service <service_id|name>]`

**What it does:** Triggers an immediate one-off sync using the installed service profile and policy. Unlike running `lore sync` manually, this goes through the service policy layer (status file, stage-aware outcomes, error classification) — giving you the same behavior the scheduler would produce, but on-demand.

**User journey:**
1. User runs `lore service trigger`
2. CLI reads the manifest to determine profile
3. By default, respects current backoff/paused state (reports skip reason if blocked)
4. With `--ignore-backoff`, bypasses backoff window (but NOT paused state — use `resume` for that)
5. Executes `handle_service_run` logic
6. Updates status file with the run result

**Acceptance criteria:**
- Uses the installed profile from the manifest
- Default: respects backoff and paused states
- `--ignore-backoff`: bypasses backoff window, still respects paused
- If not installed, returns actionable error
- Robot and human output modes (same format as `service run` output)

---

### `lore service repair [--service <service_id|name>]`

**What it does:** Repairs corrupt manifest or status files by backing them up and reinitializing. This is a safe alternative to manually deleting files and reinstalling.

**User journey:**
1. User runs `lore service repair` (typically after seeing `ServiceCorruptState` errors)
2. CLI checks manifest and status files for JSON parseability
3. If corrupt: renames the corrupt file to `{name}.corrupt.{timestamp}` (backup, not delete)
4. Reinitializes the status file to default state
5. If manifest is corrupt, reports that reinstallation is needed
6. Outputs what was repaired

**Acceptance criteria:**
- Never deletes files — backs up corrupt files with `.corrupt.{timestamp}` suffix
- If both files are valid, reports "No repair needed" (exit 0)
- If manifest is corrupt, clears it and advises `lore service install`
- If status file is corrupt, reinitializes to default
- Robot and human output modes

**Robot output:**
```json
{
  "ok": true,
  "data": {
    "repaired": true,
    "actions": [
      { "file": "sync-status-a1b2c3d4e5f6.json", "action": "reinitialized", "backup": "sync-status-a1b2c3d4e5f6.json.corrupt.1707480000" }
    ],
    "needs_reinstall": false
  },
  "meta": { "elapsed_ms": 5 }
}
```

**Human output:**
```
Service repaired (a1b2c3d4e5f6):
  Reinitialized: sync-status-a1b2c3d4e5f6.json
  Backed up:     sync-status-a1b2c3d4e5f6.json.corrupt.1707480000
```

---

## Install Manifest

### Location
`{get_data_dir()}/service-manifest-{service_id}.json` — e.g., `~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json`

### Purpose
Avoids brittle parsing of platform-specific files (plist XML, systemd units) to recover install configuration. `service status` reads the manifest first, then verifies platform state matches. The `service_id` suffix enables multiple coexisting installations for different workspaces.

### Schema
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServiceManifest {
    /// Schema version for forward compatibility (start at 1)
    pub schema_version: u32,
    /// Stable identity for this service installation
    pub service_id: String,
    /// Canonical workspace root used in identity derivation
    pub workspace_root: String,
    /// When the service was first installed
    pub installed_at_iso: String,
    /// When the manifest was last written
    pub updated_at_iso: String,
    /// Platform backend
    pub platform: String,
    /// Configured interval in seconds
    pub interval_seconds: u64,
    /// Sync profile (fast/balanced/full)
    pub profile: String,
    /// Absolute path to the lore binary
    pub binary_path: String,
    /// Optional config path override
    #[serde(skip_serializing_if = "Option::is_none")]
    pub config_path: Option<String>,
    /// How the token is stored
    pub token_source: String,
    /// Token environment variable name
    pub token_env_var: String,
    /// Paths to generated service files
    pub service_files: Vec<String>,
    /// The exact command the scheduler runs
    pub sync_command: String,
    /// Circuit breaker threshold (consecutive transient failures before pause)
    pub max_transient_failures: u32,
    /// Cooldown period before circuit breaker enters half-open probe state (seconds)
    pub circuit_breaker_cooldown_seconds: u64,
    /// SHA-256 hash of generated scheduler artifacts (plist/unit/wrapper content).
    /// Used for spec-level drift detection: if file content on disk doesn't match
    /// this hash, something external modified the service files.
    pub spec_hash: String,
}
```

### `service_id` derivation

```rust
/// Compute a stable service ID from a canonical identity tuple:
/// (workspace_root + config_path + sorted project URLs).
///
/// This avoids collisions when multiple workspaces share one global config
/// by incorporating what is being synced (project URLs) and where the workspace
/// lives alongside the config location.
/// Returns first 12 hex chars of SHA-256 (48 bits — collision-safe for local use).
pub fn compute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String {
    use sha2::{Sha256, Digest};
    let canonical_config = config_path.canonicalize()
        .unwrap_or_else(|_| config_path.to_path_buf());
    let canonical_workspace = workspace_root.canonicalize()
        .unwrap_or_else(|_| workspace_root.to_path_buf());
    let mut hasher = Sha256::new();
    hasher.update(canonical_workspace.to_string_lossy().as_bytes());
    hasher.update(b"\0");
    hasher.update(canonical_config.to_string_lossy().as_bytes());
    // Sort URLs for determinism regardless of config ordering
    let mut urls: Vec<&str> = project_urls.to_vec();
    urls.sort_unstable();
    for url in &urls {
        hasher.update(b"\0"); // separator to prevent concatenation collisions
        hasher.update(url.as_bytes());
    }
    let hash = hasher.finalize();
    hex::encode(&hash[..6]) // 12 hex chars
}

/// Sanitize a user-provided name to [a-z0-9-], max 32 chars.
pub fn sanitize_service_name(name: &str) -> Result<String, String> {
    let sanitized: String = name.to_lowercase()
        .chars()
        .map(|c| if c.is_ascii_alphanumeric() || c == '-' { c } else { '-' })
        .collect();
    let trimmed = sanitized.trim_matches('-').to_string();
    if trimmed.is_empty() {
        return Err("Service name must contain at least one alphanumeric character".into());
    }
    if trimmed.len() > 32 {
        return Err("Service name must be 32 characters or fewer".into());
    }
    Ok(trimmed)
}
```

### Read/Write
- `ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError>` — returns `Ok(None)` if file doesn't exist, `Err` if file exists but is corrupt/unparseable (distinguishes missing from corrupt). **Schema migration:** If the file has `schema_version < CURRENT_VERSION`, the read method migrates the in-memory model to the current version (adding default values for new fields) and atomically rewrites the file. If the file has an unknown future `schema_version` (higher than current), it returns `Err(ServiceCorruptState)` with an actionable message to update `lore`.
- `ServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()>` — writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed.
- Written by `service install`, read by `service status`, `service run`, `service uninstall`
- `service uninstall` removes the manifest file

---

## Status File

### Location
`{get_data_dir()}/sync-status-{service_id}.json` — e.g., `~/.local/share/lore/sync-status-a1b2c3d4e5f6.json`

Add `get_service_status_path(service_id: &str)` to `src/core/paths.rs`.

**Service-scoped status:** Each installed service gets its own status file, keyed by `service_id`. This prevents cross-service contamination — a `fast` profile service pausing due to transient errors should not affect a `full` profile service's state. The pipeline lock remains global (`sync_pipeline`) to prevent overlapping writes to the shared database.

### Schema
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncStatusFile {
    /// Schema version for forward compatibility (start at 1)
    pub schema_version: u32,
    /// When this status file was last written
    pub updated_at_iso: String,
    /// Most recent run result (None if no runs yet — matches idle state)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_run: Option<SyncRunRecord>,
    /// Rolling window of recent runs (last 10, newest first)
    #[serde(default)]
    pub recent_runs: Vec<SyncRunRecord>,
    /// Count of consecutive failures (resets to 0 on success or degraded outcome)
    pub consecutive_failures: u32,
    /// Persisted next retry time (set on failure, cleared on success/resume).
    /// Computed once at failure time with jitter, then read-only comparison afterward.
    /// This avoids recomputing jitter on every status check.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub next_retry_at_ms: Option<i64>,
    /// If set, service is paused due to a permanent error or circuit breaker
    #[serde(skip_serializing_if = "Option::is_none")]
    pub paused_reason: Option<String>,
    /// Timestamp when circuit breaker entered paused state (for cooldown calculation)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub circuit_breaker_paused_at_ms: Option<i64>,
    /// Error code that caused the pause (for machine consumption)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_error_code: Option<String>,
    /// Error message from last failure
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_error_message: Option<String>,
    /// In-flight run metadata for crash/stale detection. Written to the status file at run start,
    /// cleared on completion (success or failure). If present when a new run starts, the previous
    /// run crashed or was killed.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub current_run: Option<CurrentRunState>,
}

/// Metadata for an in-flight sync run. Used to detect stale/crashed runs.
/// Written to the status file at run start, cleared on completion (success or failure).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CurrentRunState {
    /// Unix timestamp (ms) when this run started
    pub started_at_ms: i64,
    /// PID of the process executing this run
    pub pid: u32,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncRunRecord {
    /// ISO-8601 timestamp of this sync run
    pub timestamp_iso: String,
    /// Unix timestamp in milliseconds
    pub timestamp_ms: i64,
    /// How long the sync took
    pub duration_seconds: f64,
    /// Run outcome: "success", "degraded", or "failed"
    pub outcome: String,
    /// Per-stage results (only present in detailed records, not in recent_runs summary)
    #[serde(default, skip_serializing_if = "Vec::is_empty")]
    pub stage_results: Vec<StageResult>,
    /// Error message if sync failed (None on success/degraded)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error_message: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StageResult {
    /// Stage name: "issues", "mrs", "docs", "embeddings"
    pub stage: String,
    /// Whether this stage completed successfully
    pub success: bool,
    /// Number of items created/updated (0 on failure)
    #[serde(default)]
    pub items_updated: usize,
    /// Error message if stage failed
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error: Option<String>,
    /// Machine-readable error code from the underlying LoreError (e.g., "AUTH_FAILED", "NETWORK_ERROR").
    /// Propagated through the stage execution layer for reliable error classification.
    /// Falls back to string matching on `error` field when not available.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error_code: Option<String>,
}
```

### Read/Write
- `SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError>` — returns `Ok(None)` if file doesn't exist, `Err` if file exists but is corrupt/unparseable (distinguishes missing from corrupt — a corrupt status file is a warning, not fatal). **Schema migration:** Same behavior as `ServiceManifest::read` — migrates older versions to current, rejects unknown future versions.
- `SyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()>` — writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed. Atomic writes prevent truncated JSON from crashes during write.
- `SyncStatusFile::record_run(&mut self, run: SyncRunRecord)` — pushes to `recent_runs` (capped at 10), updates `last_run`
- `SyncStatusFile::clear_paused(&mut self)` — clears `paused_reason`, `circuit_breaker_paused_at_ms`, `last_error_*`, `next_retry_at_ms`, resets `consecutive_failures`
- File is **NOT** a fatal error source — if write fails, log a warning and continue (sync result matters more than recording it)

### Backoff Logic

Backoff applies **only** to transient errors and **only** within `service run`. Manual `lore sync` is never subject to backoff. Permanent errors bypass backoff entirely and enter the paused state.

**Key design change (from feedback):** Instead of recomputing jitter on every `service status` / `service run` check, we compute `next_retry_at_ms` **once** at failure time and persist it. This makes status output stable, avoids predictable jitter from timestamp-seeded determinism, and simplifies the read path to a single comparison.

```rust
/// Injectable time source for deterministic testing.
pub trait Clock: Send + Sync {
    fn now_ms(&self) -> i64;
}

/// Production clock using chrono.
pub struct SystemClock;
impl Clock for SystemClock {
    fn now_ms(&self) -> i64 {
        chrono::Utc::now().timestamp_millis()
    }
}

/// Injectable RNG for deterministic jitter tests.
pub trait JitterRng: Send + Sync {
    /// Returns a value in [0.0, 1.0)
    fn next_f64(&mut self) -> f64;
}

/// Production RNG using thread_rng.
pub struct ThreadJitterRng;
impl JitterRng for ThreadJitterRng {
    fn next_f64(&mut self) -> f64 {
        use rand::Rng;
        rand::thread_rng().gen()
    }
}

impl SyncStatusFile {
    /// Check if we're still in a backoff window.
    /// Returns None if sync should proceed.
    /// Returns Some(remaining_seconds) if within backoff window.
    /// Reads the persisted `next_retry_at_ms` — no jitter computation on the read path.
    pub fn backoff_remaining(&self, clock: &dyn Clock) -> Option<u64> {
        // Paused state is handled separately (not via backoff)
        if self.paused_reason.is_some() {
            return None; // caller checks paused_reason directly
        }

        if self.consecutive_failures == 0 {
            return None;
        }

        let next_retry = self.next_retry_at_ms?;
        let now_ms = clock.now_ms();

        if now_ms < next_retry {
            Some(((next_retry - now_ms) / 1000) as u64)
        } else {
            None // backoff expired, proceed
        }
    }

    /// Compute and set next_retry_at_ms after a transient failure.
    /// Called once at failure time — jitter is applied here, not on reads.
    /// Uses the *configured* interval as the backoff base (not a hardcoded value).
    /// If the server provided a retry hint (e.g., Retry-After header), it is
    /// respected as a floor: next_retry_at_ms = max(computed_backoff, hint).
    pub fn set_backoff(
        &mut self,
        base_interval_seconds: u64,
        clock: &dyn Clock,
        rng: &mut dyn JitterRng,
        retry_after_ms: Option<i64>,
    ) {
        let exponent = (self.consecutive_failures - 1).min(20); // prevent overflow
        let base_backoff = (base_interval_seconds as u128)
            .saturating_mul(1u128 << exponent)
            .min(4 * 3600) as u64; // cap at 4 hours

        // Full jitter: uniform random in [base_interval..cap]
        // This decorrelates retries across multiple installations while ensuring
        // the minimum backoff is always at least the configured interval.
        let jitter_factor = rng.next_f64(); // 0.0..1.0
        let min_backoff = base_interval_seconds;
        let span = base_backoff.saturating_sub(min_backoff);
        let backoff_secs = min_backoff + ((span as f64) * jitter_factor) as u64;

        let computed_retry_at = clock.now_ms() + (backoff_secs as i64 * 1000);

        // Respect server-provided retry hint as a floor
        self.next_retry_at_ms = Some(match retry_after_ms {
            Some(hint) => computed_retry_at.max(hint),
            None => computed_retry_at,
        });
    }
}
```

**Key design decisions:**
- `next_retry_at_ms` is computed once on failure and persisted — `service status` simply reads it for stable, consistent display
- Backoff base is the configured interval, not a hardcoded 1800s — a user with `--interval 5m` gets shorter backoffs than one with `--interval 1h`
- Full jitter (random in `[base_interval..cap]`) decorrelates retries across multiple installations, avoiding thundering herd
- Injectable `JitterRng` trait enables deterministic testing without seeding from timestamps
- Paused state is checked separately from backoff — they are orthogonal concerns
- `next_retry_at_ms` is cleared on success and on `service resume`

### Backoff examples with 30m (1800s) base interval:
| consecutive_failures | max_backoff_seconds | human-readable range |
|---------------------|---------------------|----------------------|
| 1 | 1800 | 30 min (jittered within [30m, 30m]) |
| 2 | 3600 | up to 1 hour (min 30m) |
| 3 | 7200 | up to 2 hours (min 30m) |
| 4 | 14400 | up to 4 hours (capped, min 30m) |
| 5-9 | 14400 | up to 4 hours (capped, min 30m) |
| 10 | — | **circuit breaker trips → paused** |

---

## Service Run Implementation (`handle_service_run`)

**Critical:** Backoff, error classification, circuit breaker, stage-aware execution, and status file management live **only** in `handle_service_run`. The manual `handle_sync_cmd` is NOT modified — it does not read or write the service status file.

### Location: `src/cli/commands/service/run.rs`

```rust
pub fn handle_service_run(service_id: &str, start: std::time::Instant) -> Result<(), Box<dyn std::error::Error>> {
    let clock = SystemClock;
    let mut rng = ThreadJitterRng;

    // 1. Read manifest for the given service_id
    let manifest_path = lore::core::paths::get_service_manifest_path(service_id);
    let manifest = ServiceManifest::read(&manifest_path)?
        .ok_or_else(|| LoreError::ServiceError {
            message: format!("Service manifest not found for service_id '{service_id}'. Is the service installed?"),
        })?;

    // 2. Read status file (service-scoped)
    let status_path = lore::core::paths::get_service_status_path(&manifest.service_id);
    let mut status = match SyncStatusFile::read(&status_path) {
        Ok(Some(s)) => s,
        Ok(None) => SyncStatusFile::default(),
        Err(e) => {
            tracing::warn!(error = %e, "Corrupt status file, starting fresh");
            SyncStatusFile::default()
        }
    };

    // 3. Check paused state (permanent error or circuit breaker)
    if let Some(reason) = &status.paused_reason {
        // Check for circuit breaker half-open transition
        let is_circuit_breaker = reason.starts_with("CIRCUIT_BREAKER");
        let half_open = is_circuit_breaker
            && status.circuit_breaker_paused_at_ms.map_or(false, |paused_at| {
                let cooldown_ms = (manifest.circuit_breaker_cooldown_seconds as i64) * 1000;
                clock.now_ms() >= paused_at + cooldown_ms
            });

        if half_open {
            // Cooldown expired — allow probe run (continue to step 5)
            tracing::info!("Circuit breaker half-open: allowing probe run");
        } else {
            print_robot_json(json!({
                "ok": true,
                "data": {
                    "action": "paused",
                    "reason": reason,
                    "suggestion": if is_circuit_breaker {
                        format!("Waiting for cooldown ({}s). Or run: lore service resume",
                            manifest.circuit_breaker_cooldown_seconds)
                    } else {
                        "Fix the issue, then run: lore service resume".to_string()
                    }
                },
                "meta": { "elapsed_ms": start.elapsed().as_millis() }
            }));
            return Ok(());
        }
    }

    // 4. Check backoff (reads persisted next_retry_at_ms — no jitter computation)
    if let Some(remaining) = status.backoff_remaining(&clock) {
        print_robot_json(json!({
            "ok": true,
            "data": {
                "action": "skipped",
                "reason": "backoff",
                "consecutive_failures": status.consecutive_failures,
                "next_retry_iso": status.next_retry_at_ms.map(|ms| {
                    chrono::DateTime::from_timestamp_millis(ms)
                        .map(|dt| dt.to_rfc3339())
                }),
                "remaining_seconds": remaining,
            },
            "meta": { "elapsed_ms": start.elapsed().as_millis() }
        }));
        return Ok(());
    }

    // 5. Acquire pipeline lock
    let lock = match AppLock::try_acquire("sync_pipeline", stale_minutes) {
        Ok(lock) => lock,
        Err(_) => {
            print_robot_json(json!({
                "ok": true,
                "data": { "action": "skipped", "reason": "locked" },
                "meta": { "elapsed_ms": start.elapsed().as_millis() }
            }));
            return Ok(());
        }
    };

    // 6. Write current_run metadata for stale-run detection
    status.current_run = Some(CurrentRunState {
        started_at_ms: clock.now_ms(),
        pid: std::process::id(),
    });
    let _ = status.write_atomic(&status_path); // best-effort

    // 7. Build sync args from profile
    let sync_args = manifest.profile_to_sync_args();

    // 8. Execute sync pipeline stage-by-stage
    let stage_results = execute_sync_stages(&sync_args);

    // 8. Classify outcome
    let core_failed = stage_results.iter()
        .any(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);
    let optional_failed = stage_results.iter()
        .any(|s| (s.stage == "docs" || s.stage == "embeddings") && !s.success);
    let all_success = stage_results.iter().all(|s| s.success);

    let outcome = if all_success {
        "success"
    } else if !core_failed && optional_failed {
        "degraded"
    } else {
        "failed"
    };

    let run = SyncRunRecord {
        timestamp_iso: chrono::Utc::now().to_rfc3339(),
        timestamp_ms: clock.now_ms(),
        duration_seconds: start.elapsed().as_secs_f64(),
        outcome: outcome.to_string(),
        stage_results: stage_results.clone(),
        error_message: if outcome == "failed" {
            stage_results.iter()
                .find(|s| !s.success)
                .and_then(|s| s.error.clone())
        } else {
            None
        },
    };
    status.record_run(run);

    match outcome {
        "success" | "degraded" => {
            // Degraded does NOT count as a failure — core data is fresh
            status.consecutive_failures = 0;
            status.next_retry_at_ms = None;
            status.paused_reason = None;
            status.last_error_code = None;
            status.last_error_message = None;
        }
        "failed" => {
            let core_error = stage_results.iter()
                .find(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);

            // Check if the underlying error is permanent
            if let Some(stage) = core_error {
                if is_permanent_stage_error(stage) {
                    status.paused_reason = Some(format!(
                        "{}: {}",
                        stage.stage,
                        stage.error.as_deref().unwrap_or("unknown error")
                    ));
                    status.last_error_code = Some("PERMANENT".to_string());
                    status.last_error_message = stage.error.clone();
                    // Don't increment consecutive_failures — we're pausing
                } else {
                    status.consecutive_failures = status.consecutive_failures.saturating_add(1);
                    status.last_error_code = Some("TRANSIENT".to_string());
                    status.last_error_message = stage.error.clone();

                    // Circuit breaker check
                    if status.consecutive_failures >= manifest.max_transient_failures {
                        status.paused_reason = Some(format!(
                            "CIRCUIT_BREAKER: {} consecutive transient failures (last: {})",
                            status.consecutive_failures,
                            stage.error.as_deref().unwrap_or("unknown")
                        ));
                        status.circuit_breaker_paused_at_ms = Some(clock.now_ms());
                        status.next_retry_at_ms = None; // paused, not backing off
                    } else {
                        // Extract retry hint from stage error if available (e.g., Retry-After header)
                        let retry_hint = extract_retry_after_hint(stage);
                        status.set_backoff(manifest.interval_seconds, &clock, &mut rng, retry_hint);
                    }
                }
            }
        }
        _ => unreachable!(),
    }

    // 9. Clear current_run (run is complete)
    status.current_run = None;

    // 10. Write status atomically (best-effort)
    if let Err(e) = status.write_atomic(&status_path) {
        tracing::warn!(error = %e, "Failed to write sync status file");
    }

    // 10. Release lock (drop)
    drop(lock);

    // 11. Print result
    print_robot_json(json!({
        "ok": true,
        "data": {
            "action": if outcome == "failed" && status.paused_reason.is_some() { "paused" } else { "sync_completed" },
            "outcome": outcome,
            "profile": manifest.profile,
            "duration_seconds": start.elapsed().as_secs_f64(),
            "stage_results": stage_results,
            "consecutive_failures": status.consecutive_failures,
        },
        "meta": { "elapsed_ms": start.elapsed().as_millis() }
    }));

    Ok(())
}
```

### Error classification helpers

```rust
/// Classify by ErrorCode (used when we have the LoreError directly)
fn is_permanent_error(e: &LoreError) -> bool {
    matches!(
        e.code(),
        ErrorCode::TokenNotSet
            | ErrorCode::AuthFailed
            | ErrorCode::ConfigNotFound
            | ErrorCode::ConfigInvalid
            | ErrorCode::MigrationFailed
    )
}

/// Classify from error_code string (primary) or error message string (fallback).
/// The error_code field is propagated through stage execution and is the
/// preferred classification mechanism. String matching on the error message
/// is a fallback for stages that don't yet propagate error_code.
fn is_permanent_stage_error(stage: &StageResult) -> bool {
    // Primary: classify by machine-readable error code
    if let Some(code) = &stage.error_code {
        return matches!(
            code.as_str(),
            "TOKEN_NOT_SET" | "AUTH_FAILED" | "CONFIG_NOT_FOUND"
                | "CONFIG_INVALID" | "MIGRATION_FAILED"
        );
    }
    // Fallback: string matching (for stages that don't yet propagate error_code)
    stage.error.as_deref().map_or(false, |m| {
        m.contains("401 Unauthorized")
            || m.contains("TokenNotSet")
            || m.contains("ConfigNotFound")
            || m.contains("ConfigInvalid")
            || m.contains("MigrationFailed")
    })
}
```

> **Implementation note:** The `error_code` field on `StageResult` is the primary classification mechanism. Each stage's execution wrapper should catch `LoreError`, extract its `ErrorCode` via `.code().to_string()`, and populate the `error_code` field. The string-matching fallback exists for robustness but should not be the primary path.

### Pipeline lock

The `sync_pipeline` lock uses the existing `AppLock` mechanism (same as the ingest lock). It prevents:
- Two `service run` invocations overlapping (if scheduler fires before previous run completes)
- A `service run` overlapping with a manual `lore sync` (the manual sync should also acquire this lock)

**Change to `handle_sync_cmd`:** Add `sync_pipeline` lock acquisition at the top of `handle_sync_cmd` as well. This is the **only** change to the manual sync path — no backoff, no status file writes. If the lock is already held by a `service run`, manual sync waits briefly then fails with a clear message ("A scheduled sync is in progress. Wait for it to complete or use --force to override.").

```rust
// In handle_sync_cmd, after config load:
let _pipeline_lock = AppLock::try_acquire("sync_pipeline", stale_lock_minutes)
    .map_err(|_| LoreError::ServiceError {
        message: "Another sync is in progress. Wait for it to complete or use --force.".into(),
    })?;
```

---

## Platform Backends

### Architecture

`src/cli/commands/service/platform/mod.rs` exports free functions that dispatch via `#[cfg(target_os)]`. All functions take `service_id` to construct platform-specific identifiers:

```rust
pub fn install(service_id: &str, ...) -> Result<InstallResult> {
    #[cfg(target_os = "macos")]
    return launchd::install(service_id, ...);
    #[cfg(target_os = "linux")]
    return systemd::install(service_id, ...);
    #[cfg(target_os = "windows")]
    return schtasks::install(service_id, ...);
    #[cfg(not(any(target_os = "macos", target_os = "linux", target_os = "windows")))]
    return Err(LoreError::ServiceUnsupported);
}
```

Same pattern for `uninstall()`, `is_installed()`, `get_state()`, `service_file_paths()`, `platform_name()`.

> **Architecture note:** A `SchedulerBackend` trait is the target architecture for deterministic integration testing with a `FakeBackend` that simulates install/uninstall/state without touching the OS. For v1, the `#[cfg]` dispatch + `run_cmd` helper provides adequate testability — unit tests validate template generation (string output, no OS calls) and `run_cmd` captures all OS interactions with kill+reap timeout handling. The function signatures already mirror the trait shape (`install`, `uninstall`, `is_installed`, `get_state`, `service_file_paths`, `check_prerequisites`), making the trait extraction a low-risk refactoring target for v2. When extracted, the trait should be parameterized by `service_id` and return `Result<T>` for all operations.

### Command Runner Helper

All platform backends use a shared `run_cmd` helper for consistent error handling:

```rust
/// Execute a system command with timeout and stderr capture.
/// Returns stdout on success, ServiceCommandFailed on failure.
/// On timeout, kills the child process and waits to reap it (prevents zombie processes).
fn run_cmd(program: &str, args: &[&str], timeout_secs: u64) -> Result<String> {
    let mut child = std::process::Command::new(program)
        .args(args)
        .stdout(std::process::Stdio::piped())
        .stderr(std::process::Stdio::piped())
        .spawn()
        .map_err(|e| LoreError::ServiceCommandFailed {
            cmd: format!("{} {}", program, args.join(" ")),
            exit_code: None,
            stderr: e.to_string(),
        })?;

    // Wait with timeout; on timeout kill and reap
    // This prevents process leaks that can wedge repeated runs.
    let output = wait_with_timeout_kill_and_reap(&mut child, timeout_secs)?;

    if output.status.success() {
        Ok(String::from_utf8_lossy(&output.stdout).to_string())
    } else {
        Err(LoreError::ServiceCommandFailed {
            cmd: format!("{} {}", program, args.join(" ")),
            exit_code: output.status.code(),
            stderr: String::from_utf8_lossy(&output.stderr).to_string(),
        })
    }
}

/// Wait for child process with timeout. On timeout, sends SIGKILL and waits
/// for the process to be reaped (prevents zombie processes on Unix).
///
/// NOTE: stdout/stderr are read after exit. This is safe for scheduler commands
/// (launchctl, systemctl, schtasks) which produce small output. For commands
/// that could produce large output (>64KB), concurrent draining via threads or
/// `child.wait_with_output()` would be needed to prevent pipe backpressure deadlock.
fn wait_with_timeout_kill_and_reap(
    child: &mut std::process::Child,
    timeout_secs: u64,
) -> Result<std::process::Output> {
    use std::time::{Duration, Instant};

    let deadline = Instant::now() + Duration::from_secs(timeout_secs);

    loop {
        match child.try_wait() {
            Ok(Some(status)) => {
                let stdout = child.stdout.take().map_or(Vec::new(), |mut s| {
                    let mut buf = Vec::new();
                    std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
                    buf
                });
                let stderr = child.stderr.take().map_or(Vec::new(), |mut s| {
                    let mut buf = Vec::new();
                    std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
                    buf
                });
                return Ok(std::process::Output { status, stdout, stderr });
            }
            Ok(None) => {
                if Instant::now() >= deadline {
                    // Timeout: kill and reap
                    let _ = child.kill();
                    let _ = child.wait(); // reap to prevent zombie
                    return Err(LoreError::ServiceCommandFailed {
                        cmd: "(timeout)".into(),
                        exit_code: None,
                        stderr: format!("Process timed out after {timeout_secs}s"),
                    });
                }
                std::thread::sleep(Duration::from_millis(100));
            }
            Err(e) => return Err(LoreError::ServiceCommandFailed {
                cmd: "(wait)".into(),
                exit_code: None,
                stderr: e.to_string(),
            }),
        }
    }
}
```

This ensures all `launchctl`, `systemctl`, and `schtasks` failures produce consistent, machine-readable errors with the exact command, exit code, and stderr captured.

### Token Storage Helper

```rust
/// Write token to a user-private env file, scoped by service_id.
/// Returns the path to the env file.
///
/// Rejects tokens containing NUL bytes or newlines to prevent env-file injection.
/// The token is written as a raw value (not shell-quoted) and read via `cat` in
/// the wrapper script, never `source`d or `eval`d.
fn write_token_env_file(
    data_dir: &Path,
    service_id: &str,
    token_env_var: &str,
    token_value: &str,
) -> Result<PathBuf> {
    // Validate token content — reject values that could break env-file format
    if token_value.contains('\0') || token_value.contains('\n') || token_value.contains('\r') {
        return Err(LoreError::ServiceError {
            message: "Token contains NUL or newline characters, which are not safe for env-file storage. \
                      Use --token-source embedded instead.".into(),
        });
    }

    let env_path = data_dir.join(format!("service-env-{service_id}"));
    let content = format!("{}={}\n", token_env_var, token_value);

    // Write atomically: tmp file + fsync
    let tmp_path = env_path.with_extension("tmp");
    std::fs::write(&tmp_path, &content)?;

    // Set permissions to 0600 (owner read/write only) BEFORE rename
    #[cfg(unix)]
    {
        use std::os::unix::fs::PermissionsExt;
        std::fs::set_permissions(&tmp_path, std::fs::Permissions::from_mode(0o600))?;
    }

    std::fs::rename(&tmp_path, &env_path)?;
    Ok(env_path)
}
```

### Function signatures

```rust
pub struct InstallResult {
    pub platform: String,
    pub service_id: String,
    pub interval_seconds: u64,
    pub profile: String,
    pub binary_path: String,
    pub config_path: Option<String>,
    pub service_files: Vec<String>,
    pub sync_command: String,
    pub token_env_var: String,
    pub token_source: String,  // "env_file", "embedded", or "system_env"
}

pub struct UninstallResult {
    pub was_installed: bool,
    pub service_id: String,
    pub platform: String,
    pub removed_files: Vec<String>,
}

pub fn install(
    service_id: &str,
    binary_path: &str,
    config_path: Option<&str>,
    interval_seconds: u64,
    profile: &str,
    token_env_var: &str,
    token_value: &str,
    token_source: &str,
    log_dir: &Path,
    data_dir: &Path,
) -> Result<InstallResult>;

pub fn uninstall(service_id: &str) -> Result<UninstallResult>;
pub fn is_installed(service_id: &str) -> bool;
pub fn get_state(service_id: &str) -> Option<String>; // "loaded", "running", etc.
pub fn service_file_paths(service_id: &str) -> Vec<PathBuf>;
pub fn platform_name() -> &'static str;

/// Pre-flight check for platform-specific prerequisites.
/// Returns a list of diagnostic results.
pub fn check_prerequisites() -> Vec<DiagnosticCheck>;

pub struct DiagnosticCheck {
    pub name: String,
    pub status: DiagnosticStatus, // Pass, Warn, Fail
    pub message: Option<String>,
    pub action: Option<String>,   // Suggested fix command
}
```

---

### macOS: launchd (`platform/launchd.rs`)

**Service file:** `~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plist`

**Label:** `com.gitlore.sync.{service_id}`

**Wrapper script approach:** launchd cannot natively load environment files. Instead of embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script that reads the env file at runtime and execs `lore`. This keeps the token out of the plist entirely for the `env-file` strategy.

**Wrapper script** (`{data_dir}/service-run-{service_id}.sh`, mode 0700):
```bash
#!/bin/sh
# Generated by lore service install — do not edit
set -e
# Read token from env file (KEY=VALUE format) — never source/eval untrusted content
{token_env_var}="$(sed -n 's/^{token_env_var}=//p' "{data_dir}/service-env-{service_id}")"
export {token_env_var}
{config_export_line}
exec "{binary_path}" --robot service run --service-id "{service_id}"
```

Where `{config_export_line}` is either empty or `export LORE_CONFIG_PATH="{config_path}"`.

**Plist template** (generated via `format!()`, no crate needed):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.gitlore.sync.{service_id}</string>
    <key>ProgramArguments</key>
    <array>
        {program_arguments}
    </array>
    {env_dict}
    <key>StartInterval</key>
    <integer>{interval_seconds}</integer>
    <key>RunAtLoad</key>
    <true/>
    <key>ProcessType</key>
    <string>Background</string>
    <key>Nice</key>
    <integer>10</integer>
    <key>LowPriorityIO</key>
    <true/>
    <key>StandardOutPath</key>
    <string>{log_dir}/service-{service_id}-stdout.log</string>
    <key>StandardErrorPath</key>
    <string>{log_dir}/service-{service_id}-stderr.log</string>
    <key>TimeOut</key>
    <integer>600</integer>
</dict>
</plist>
```

Where `{program_arguments}` and `{env_dict}` depend on `token_source`:
- **env-file (default):** The plist invokes the wrapper script instead of `lore` directly. No token appears in the plist.
  ```xml
        <string>{data_dir}/service-run-{service_id}.sh</string>
  ```
  `{env_dict}` is empty (the wrapper script handles environment setup).

- **embedded:** The plist invokes `lore` directly with the token embedded in `EnvironmentVariables`.
  ```xml
        <string>{binary_path}</string>
        <string>--robot</string>
        <string>service</string>
        <string>run</string>
        <string>--service-id</string>
        <string>{service_id}</string>
  ```
  ```xml
    <key>EnvironmentVariables</key>
    <dict>
        <key>{token_env_var}</key>
        <string>{token_value}</string>
        {config_env_entry}
    </dict>
  ```

Where `{config_env_entry}` is either empty or:
```xml
        <key>LORE_CONFIG_PATH</key>
        <string>{config_path}</string>
```

**XML escaping:** The token value and paths must be XML-escaped. Write a helper `fn xml_escape(s: &str) -> String` that replaces `&`, `<`, `>`, `"`, `'` with their XML entity equivalents. This is critical — tokens can contain `&` or `<`.

**Install steps:**
1. `std::fs::create_dir_all(plist_path.parent())`
2. `std::fs::write(&plist_path, plist_content)`
3. Try `launchctl bootstrap gui/{uid} {plist_path}` via `std::process::Command`
4. If that fails (older macOS), fall back to `launchctl load {plist_path}`
5. Get UID via safe wrapper: `fn current_uid() -> u32 { unsafe { libc::getuid() } }` — isolated in a single-line function with `#[allow(unsafe_code)]` exemption since `getuid()` is trivially safe (no pointers, no mutation, always succeeds). Alternatively, use the `nix` crate's `nix::unistd::Uid::current()` if already a dependency.

**Uninstall steps:**
1. Try `launchctl bootout gui/{uid}/com.gitlore.sync.{service_id}`
2. If that fails, try `launchctl unload {plist_path}`
3. `std::fs::remove_file(&plist_path)` (ignore error if doesn't exist)

**State detection:**
- `is_installed(service_id)`: check if plist file exists on disk
- `get_state(service_id)`: run `launchctl list com.gitlore.sync.{service_id}`, parse exit code (0 = loaded, non-0 = not loaded)
- `get_interval_seconds(service_id)`: read plist file, find `<key>StartInterval</key>` then next `<integer>` value via simple string search (no XML parser needed)

**Platform prerequisites (`check_prerequisites`):**
- Verify running in a GUI login session: check `launchctl print gui/{uid}` succeeds. In SSH-only or headless contexts, launchd user agents won't load — return `Fail` with action "Log in via GUI or use SSH with ForwardAgent".
- This is a warning, not a hard block — some macOS setups (like `launchctl asuser`) can work around it.

---

### Linux: systemd (`platform/systemd.rs`)

**Service files:**
- `~/.config/systemd/user/lore-sync-{service_id}.service`
- `~/.config/systemd/user/lore-sync-{service_id}.timer`

**Service unit (hardened):**
```ini
[Unit]
Description=Gitlore GitLab data sync ({service_id})

[Service]
Type=oneshot
ExecStart={binary_path} --robot service run --service-id {service_id}
WorkingDirectory={data_dir}
SuccessExitStatus=0
TimeoutStartSec=900
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths={data_dir}
{token_env_line}
{config_env_line}
```

Where `{token_env_line}` depends on `token_source`:
- **env-file:** `EnvironmentFile={data_dir}/service-env-{service_id}` (systemd natively supports this — true file-based loading, no embedding)
- **embedded:** `Environment={token_env_var}={token_value}`

Where `{config_env_line}` is either empty or `Environment=LORE_CONFIG_PATH={config_path}`.

**Hardening notes:**
- `TimeoutStartSec=900` — kills stuck syncs after 15 minutes (generous but bounded)
- `NoNewPrivileges=true` — prevents privilege escalation
- `PrivateTmp=true` — isolated /tmp
- `ProtectSystem=strict` — read-only filesystem except explicitly allowed paths
- `ProtectHome=read-only` — read-only home directory
- `ReadWritePaths={data_dir}` — allows writing to the lore data directory (status files, logs, DB)

**Timer unit:**
```ini
[Unit]
Description=Gitlore sync timer ({service_id})

[Timer]
OnBootSec=5min
OnUnitInactiveSec={interval_seconds}s
AccuracySec=1min
Persistent=true
RandomizedDelaySec=60

[Install]
WantedBy=timers.target
```

**Install steps:**
1. `std::fs::create_dir_all(unit_dir)`
2. Write both files
3. Run `systemctl --user daemon-reload`
4. Run `systemctl --user enable --now lore-sync-{service_id}.timer`

**Uninstall steps:**
1. Run `systemctl --user disable --now lore-sync-{service_id}.timer` (ignore error)
2. Remove both files
3. Run `systemctl --user daemon-reload`

**State detection:**
- `is_installed(service_id)`: check if timer file exists
- `get_state(service_id)`: run `systemctl --user is-active lore-sync-{service_id}.timer`, capture stdout ("active", "inactive", etc.)
- `get_interval_seconds(service_id)`: read timer file, parse `OnUnitInactiveSec` value

**Platform prerequisites (`check_prerequisites`):**
- **User manager running:** Check `systemctl --user status` exits 0. If not, return `Fail` with message "systemd user manager not running. Start a user session or contact your system administrator."
- **Linger enabled:** Check `loginctl show-user $(whoami) --property=Linger` returns `Linger=yes`. If not, return `Warn` with message "loginctl linger not enabled. Timer will not fire on reboot without an active login session." and action `loginctl enable-linger $(whoami)`. This is a warning, not a block — the timer works fine when the user is logged in.

---

### Windows: schtasks (`platform/schtasks.rs`)

**Task name:** `LoreSync-{service_id}`

**Install:**
```
schtasks /create /tn "LoreSync-{service_id}" /tr "\"{binary_path}\" --robot service run --service-id {service_id}" /sc minute /mo {interval_minutes} /f
```

Note: `/mo` requires minutes, so convert seconds to minutes (round up). Minimum is 1 minute (but we enforce 5 minutes at the parse level).

**Token handling on Windows:** The env var must be set system-wide via `setx` or be present in the user's environment. Neither env-file nor embedded strategies apply — Windows scheduled tasks inherit the user's environment. Set `token_source: "system_env"` in the result and document this as a requirement.

**Uninstall:**
```
schtasks /delete /tn "LoreSync-{service_id}" /f
```

**State detection:**
- `is_installed(service_id)`: run `schtasks /query /tn "LoreSync-{service_id}"`, check exit code (0 = exists)
- `get_state(service_id)`: parse output of `schtasks /query /tn "LoreSync-{service_id}" /fo CSV /v`, extract "Status" column
- `get_interval_seconds(service_id)`: parse "Repeat: Every" from verbose output, or store the value ourselves

**Platform prerequisites (`check_prerequisites`):**
- Verify `schtasks` is available: run `schtasks /?` and check exit code. Return `Fail` if not found.

---

## Interval Parsing

```rust
/// Parse interval strings like "15m", "1h", "30m", "2h", "24h"
/// Only minutes (m) and hours (h) are accepted — seconds are not exposed
/// because the minimum interval is 5 minutes and sub-minute granularity
/// would be confusing for a scheduled sync.
pub fn parse_interval(input: &str) -> std::result::Result<u64, String> {
    let input = input.trim();

    let (num_str, multiplier) = if let Some(n) = input.strip_suffix('m') {
        (n, 60u64)
    } else if let Some(n) = input.strip_suffix('h') {
        (n, 3600u64)
    } else {
        return Err(format!(
            "Invalid interval '{input}'. Use format like 15m, 30m, 1h, 2h"
        ));
    };

    let num: u64 = num_str
        .parse()
        .map_err(|_| format!("Invalid number in interval: '{num_str}'"))?;

    if num == 0 {
        return Err("Interval must be greater than 0".to_string());
    }

    let seconds = num * multiplier;

    if seconds < 300 {
        return Err(format!(
            "Minimum interval is 5m (got {input}, which is {seconds}s)"
        ));
    }
    if seconds > 86400 {
        return Err(format!(
            "Maximum interval is 24h (got {input}, which is {seconds}s)"
        ));
    }

    Ok(seconds)
}
```

---

## Error Types

### Additions to `src/core/error.rs`

**ErrorCode enum:**
```rust
ServiceError,          // Add after Ambiguous
ServiceCommandFailed,  // OS command (launchctl/systemctl/schtasks) failed
ServiceCorruptState,   // Manifest or status file is corrupt/unparseable
```

**ErrorCode::exit_code():**
```rust
Self::ServiceError => 21,
Self::ServiceCommandFailed => 22,
Self::ServiceCorruptState => 23,
```

**ErrorCode::Display:**
```rust
Self::ServiceError => "SERVICE_ERROR",
Self::ServiceCommandFailed => "SERVICE_COMMAND_FAILED",
Self::ServiceCorruptState => "SERVICE_CORRUPT_STATE",
```

**LoreError enum:**
```rust
#[error("Service error: {message}")]
ServiceError { message: String },

#[error("Service management not supported on this platform. Requires macOS (launchd), Linux (systemd), or Windows (schtasks).")]
ServiceUnsupported,

#[error("Service command failed: {cmd} (exit {exit_code:?}): {stderr}")]
ServiceCommandFailed {
    cmd: String,
    exit_code: Option<i32>,
    stderr: String,
},

#[error("Service state file corrupt: {path}: {reason}")]
ServiceCorruptState {
    path: String,
    reason: String,
},
```

**LoreError::code():**
```rust
Self::ServiceError { .. } => ErrorCode::ServiceError,
Self::ServiceUnsupported => ErrorCode::ServiceError,
Self::ServiceCommandFailed { .. } => ErrorCode::ServiceCommandFailed,
Self::ServiceCorruptState { .. } => ErrorCode::ServiceCorruptState,
```

**LoreError::suggestion():**
```rust
Self::ServiceError { .. } => Some("Check service status: lore service status\nRun diagnostics: lore service doctor\nView logs: lore service logs"),
Self::ServiceUnsupported => Some("Requires macOS (launchd), Linux (systemd), or Windows (schtasks)"),
Self::ServiceCommandFailed { .. } => Some("Check service logs: lore service logs\nRun diagnostics: lore service doctor\nTry reinstalling: lore service install"),
Self::ServiceCorruptState { .. } => Some("Run: lore service repair\nThen reinstall if needed: lore service install"),
```

**LoreError::actions():**
```rust
Self::ServiceError { .. } => vec!["lore service status", "lore service doctor", "lore service logs"],
Self::ServiceUnsupported => vec![],
Self::ServiceCommandFailed { .. } => vec!["lore service logs", "lore service doctor", "lore service install"],
Self::ServiceCorruptState { .. } => vec!["lore service repair", "lore service install"],
```

---

## CLI Definition Changes

### `src/cli/mod.rs`

Add to `Commands` enum (after `Who(WhoArgs)`, before the hidden commands):

```rust
/// Manage the OS-native scheduled sync service
Service {
    #[command(subcommand)]
    command: ServiceCommand,
},
```

Add the `ServiceCommand` enum (can be in the same file or re-exported from `service/mod.rs`):

```rust
#[derive(Subcommand)]
pub enum ServiceCommand {
    /// Install the scheduled sync service
    Install {
        /// Sync interval (e.g., 15m, 30m, 1h). Default: 30m. Min: 5m. Max: 24h.
        #[arg(long, default_value = "30m")]
        interval: String,
        /// Sync profile: fast (issues+MRs), balanced (+ docs), full (+ embeddings)
        #[arg(long, default_value = "balanced")]
        profile: String,
        /// Token storage: env-file (default, 0600 perms) or embedded (in service file)
        #[arg(long, default_value = "env-file")]
        token_source: String,
        /// Custom service name (default: derived from config path hash).
        /// Useful when managing multiple installations for readability.
        #[arg(long)]
        name: Option<String>,
        /// Validate and render service files without writing or executing anything
        #[arg(long)]
        dry_run: bool,
    },
    /// Remove the scheduled sync service
    Uninstall {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
        /// Uninstall all services
        #[arg(long)]
        all: bool,
    },
    /// List all installed services
    List,
    /// Show service status and last sync result
    Status {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// View service logs
    Logs {
        /// Show last N lines (default: 100)
        #[arg(long)]
        tail: Option<Option<usize>>,
        /// Stream new log lines as they arrive (like tail -f)
        #[arg(long)]
        follow: bool,
        /// Open log file in editor instead of printing to stdout
        #[arg(long)]
        open: bool,
        /// Target a specific service by ID or name
        #[arg(long)]
        service: Option<String>,
    },
    /// Clear paused state and reset failure counter
    Resume {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Pause scheduled execution without uninstalling
    Pause {
        /// Reason for pausing (shown in status output)
        #[arg(long)]
        reason: Option<String>,
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Trigger an immediate one-off sync using installed profile
    Trigger {
        /// Bypass backoff window (still respects paused state)
        #[arg(long)]
        ignore_backoff: bool,
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Repair corrupt manifest or status files
    Repair {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Validate service environment and prerequisites
    Doctor {
        /// Skip network checks (token validation)
        #[arg(long)]
        offline: bool,
        /// Attempt safe, non-destructive fixes for detected issues
        #[arg(long)]
        fix: bool,
    },
    /// Execute one scheduled sync attempt (called by OS scheduler, hidden from help)
    #[command(hide = true)]
    Run {
        /// Internal selector injected by scheduler backend — identifies which
        /// service manifest and status file to use for this run.
        #[arg(long, hide = true)]
        service_id: String,
    },
}
```

### `src/cli/commands/mod.rs`

Add:
```rust
pub mod service;
```

No re-exports needed — the dispatch goes through `service::handle_install`, etc. directly.

### `src/main.rs` dispatch

Add import:
```rust
use lore::cli::ServiceCommand;
```

Add match arm (before the hidden commands):
```rust
Some(Commands::Service { command }) => {
    handle_service(cli.config.as_deref(), command, robot_mode)
}
```

Add handler function:
```rust
fn handle_service(
    config_override: Option<&str>,
    command: ServiceCommand,
    robot_mode: bool,
) -> Result<(), Box<dyn std::error::Error>> {
    let start = std::time::Instant::now();
    match command {
        ServiceCommand::Install { interval, profile, token_source, name, dry_run } => {
            lore::cli::commands::service::handle_install(
                config_override, &interval, &profile, &token_source, name.as_deref(),
                dry_run, robot_mode, start,
            )
        }
        ServiceCommand::Uninstall { service, all } => {
            lore::cli::commands::service::handle_uninstall(service.as_deref(), all, robot_mode, start)
        }
        ServiceCommand::List => {
            lore::cli::commands::service::handle_list(robot_mode, start)
        }
        ServiceCommand::Status { service } => {
            lore::cli::commands::service::handle_status(config_override, service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Logs { tail, follow, open, service } => {
            lore::cli::commands::service::handle_logs(tail, follow, open, service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Resume { service } => {
            lore::cli::commands::service::handle_resume(service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Pause { reason, service } => {
            lore::cli::commands::service::handle_pause(service.as_deref(), reason.as_deref(), robot_mode, start)
        }
        ServiceCommand::Trigger { ignore_backoff, service } => {
            lore::cli::commands::service::handle_trigger(service.as_deref(), ignore_backoff, robot_mode, start)
        }
        ServiceCommand::Repair { service } => {
            lore::cli::commands::service::handle_repair(service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Doctor { offline, fix } => {
            lore::cli::commands::service::handle_doctor(config_override, offline, fix, robot_mode, start)
        }
        ServiceCommand::Run { service_id } => {
            // Always robot mode for scheduled execution
            lore::cli::commands::service::handle_service_run(&service_id, start)
        }
    }
}
```

---

## Autocorrect Registry

### `src/cli/autocorrect.rs`

Add to `COMMAND_FLAGS` array (before the hidden commands):
```rust
("service", &["--interval", "--profile", "--token-source", "--name", "--dry-run", "--tail", "--follow", "--open", "--offline", "--fix", "--service", "--all", "--reason", "--ignore-backoff"]),
```

**Important:** The `registry_covers_command_flags` test in autocorrect.rs uses clap introspection to verify all flags are registered. Since `service` is a nested subcommand, verify whether this test recurses into subcommands. If it does, the test will fail without this entry. If it doesn't recurse (only checks top-level subcommands), the test passes but we should still add the entry for correctness.

Looking at the test (lines 868-908): it iterates `cmd.get_subcommands()` which gets the top-level subcommands. The `Service` variant uses `#[command(subcommand)]` which means clap will show `service` as a subcommand with its own sub-subcommands. The test won't recurse into `install`'s flags, but `service` itself has no direct flags (only subcommands do), so an empty entry or omission would pass the test. Adding `("service", &["--interval"])` is conservative and correct — the `--interval` flag lives on the `install` sub-subcommand but won't cause issues.

However, `detect_subcommand` only finds the *first* positional arg. For `lore service install --intervl 30m`, it returns `"service"`, not `"install"`. So the `--interval` flag needs to be registered under `"service"` for fuzzy matching.

---

## robot-docs Manifest

### Addition to `handle_robot_docs` in `src/main.rs`

Add to the `commands` JSON object:

```rust
"service": {
    "description": "Manage OS-native scheduled sync service",
    "subcommands": {
        "install": {
            "description": "Install scheduled sync service",
            "flags": ["--interval <duration>", "--profile <fast|balanced|full>", "--token-source <env-file|embedded>", "--name <optional>", "--dry-run"],
            "defaults": { "interval": "30m", "profile": "balanced", "token_source": "env-file" },
            "example": "lore --robot service install --interval 15m --profile fast",
            "response_schema": {
                "ok": "bool",
                "data.platform": "string (launchd|systemd|schtasks)",
                "data.service_id": "string",
                "data.interval_seconds": "number",
                "data.profile": "string",
                "data.binary_path": "string",
                "data.service_files": "[string]",
                "data.token_source": "string (env_file|embedded|system_env)",
                "data.no_change": "bool"
            }
        },
        "uninstall": {
            "description": "Remove scheduled sync service",
            "flags": ["--service <service_id|name>", "--all"],
            "example": "lore --robot service uninstall",
            "response_schema": {
                "ok": "bool",
                "data.was_installed": "bool",
                "data.service_id": "string",
                "data.removed_files": "[string]"
            }
        },
        "list": {
            "description": "List all installed services",
            "example": "lore --robot service list",
            "response_schema": {
                "ok": "bool",
                "data.services": "[{service_id, platform, interval_seconds, profile, installed_at_iso, platform_state, drift}]"
            }
        },
        "status": {
            "description": "Show service status, scheduler state, and recent runs",
            "flags": ["--service <service_id|name>"],
            "example": "lore --robot service status",
            "response_schema": {
                "ok": "bool",
                "data.installed": "bool",
                "data.service_id": "string|null",
                "data.platform": "string",
                "data.interval_seconds": "number|null",
                "data.profile": "string|null",
                "data.scheduler_state": "string (idle|running|running_stale|degraded|backoff|half_open|paused|not_installed)",
                "data.last_sync": "SyncRunRecord|null",
                "data.recent_runs": "[SyncRunRecord]",
                "data.backoff": "object|null",
                "data.paused_reason": "string|null",
                "data.drift": "object|null {platform_drift: bool, spec_drift: bool, command_drift: bool}"
            }
        },
        "logs": {
            "description": "View service logs (human: editor/tail, robot: path + optional lines)",
            "flags": ["--tail <n>", "--follow"],
            "example": "lore --robot service logs --tail 50",
            "response_schema": {
                "ok": "bool",
                "data.log_path": "string",
                "data.exists": "bool",
                "data.size_bytes": "number",
                "data.last_lines": "[string]|null"
            }
        },
        "resume": {
            "description": "Clear paused state and reset failure counter",
            "example": "lore --robot service resume",
            "response_schema": {
                "ok": "bool",
                "data.was_paused": "bool",
                "data.previous_reason": "string|null",
                "data.consecutive_failures_cleared": "number"
            }
        },
        "pause": {
            "description": "Pause scheduled execution without uninstalling",
            "flags": ["--reason <text>", "--service <service_id|name>"],
            "example": "lore --robot service pause --reason 'maintenance'",
            "response_schema": {
                "ok": "bool",
                "data.service_id": "string",
                "data.paused": "bool",
                "data.reason": "string",
                "data.already_paused": "bool"
            }
        },
        "trigger": {
            "description": "Trigger immediate one-off sync using installed profile",
            "flags": ["--ignore-backoff", "--service <service_id|name>"],
            "example": "lore --robot service trigger",
            "response_schema": "Same as service run output"
        },
        "repair": {
            "description": "Repair corrupt manifest or status files",
            "flags": ["--service <service_id|name>"],
            "example": "lore --robot service repair",
            "response_schema": {
                "ok": "bool",
                "data.repaired": "bool",
                "data.actions": "[{file, action, backup?}]",
                "data.needs_reinstall": "bool"
            }
        },
        "doctor": {
            "description": "Validate service environment and prerequisites",
            "flags": ["--offline", "--fix"],
            "example": "lore --robot service doctor",
            "response_schema": {
                "ok": "bool",
                "data.checks": "[{name, status, message?, action?}]",
                "data.overall": "string (pass|warn|fail)"
            }
        }
    }
}
```

---

## Paths Module Additions

### `src/core/paths.rs`

```rust
pub fn get_service_status_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("sync-status-{service_id}.json"))
}

pub fn get_service_manifest_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-manifest-{service_id}.json"))
}

pub fn get_service_env_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-env-{service_id}"))
}

pub fn get_service_wrapper_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-run-{service_id}.sh"))
}

pub fn get_service_log_path(service_id: &str, stream: &str) -> PathBuf {
    get_data_dir().join("logs").join(format!("service-{service_id}-{stream}.log"))
}

// stream values: "stdout" or "stderr"
// Example: get_service_log_path("a1b2c3d4e5f6", "stderr")
//   => ~/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log

/// List all installed service IDs by scanning for manifest files.
pub fn list_service_ids() -> Vec<String> {
    let data_dir = get_data_dir();
    std::fs::read_dir(&data_dir)
        .unwrap_or_else(|_| /* return empty iterator */)
        .filter_map(|entry| {
            let name = entry.ok()?.file_name().to_string_lossy().to_string();
            name.strip_prefix("service-manifest-")
                .and_then(|s| s.strip_suffix(".json"))
                .map(String::from)
        })
        .collect()
}
```

Note: Status files are scoped by `service_id` — each installed service gets independent backoff/paused/circuit-breaker state. The pipeline lock remains global (`sync_pipeline`) to prevent overlapping writes to the shared database.

---

## Core Module Registration

### `src/core/mod.rs`

Add:
```rust
pub mod sync_status;
pub mod service_manifest;
```

---

## File-by-File Implementation Details

### `src/core/sync_status.rs` (NEW)

- `SyncRunRecord` struct with Serialize + Deserialize + Clone
- `StageResult` struct with Serialize + Deserialize + Clone
- `SyncStatusFile` struct with Serialize + Deserialize + Default (schema_version=1)
- `Clock` trait + `SystemClock` impl (for deterministic testing)
- `JitterRng` trait + `ThreadJitterRng` impl (for deterministic jitter testing)
- `parse_interval(input: &str) -> Result<u64, String>`
- `SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError>` — distinguishes missing from corrupt
- `SyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()>` — tmp+fsync+rename
- `SyncStatusFile::record_run(&mut self, run: SyncRunRecord)` — push to recent_runs (capped at 10)
- `SyncStatusFile::clear_paused(&mut self)` — reset paused_reason, errors, failures, next_retry_at_ms
- `SyncStatusFile::backoff_remaining(&self, clock: &dyn Clock) -> Option<u64>` — reads persisted next_retry_at_ms
- `SyncStatusFile::set_backoff(&mut self, base_interval_seconds, clock, rng)` — compute and persist next_retry_at_ms
- `fn is_permanent_error(code: &ErrorCode) -> bool`
- `fn is_permanent_stage_error(stage: &StageResult) -> bool` — primary: error_code, fallback: string matching
- `SyncStatusFile::is_circuit_breaker_half_open(&self, manifest: &ServiceManifest, clock: &dyn Clock) -> bool` — checks if cooldown has expired
- Unit tests for all of the above

### `src/core/service_manifest.rs` (NEW)

- `ServiceManifest` struct with Serialize + Deserialize (schema_version=1), includes `workspace_root` and `spec_hash`
- `ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError>` — distinguishes missing from corrupt
- `ServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()>` — tmp+fsync+rename
- `ServiceManifest::profile_to_sync_args(&self) -> Vec<String>` — maps profile to sync CLI flags
- `compute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String` — composite fingerprint (workspace root + config path + sorted project URLs), first 12 hex chars of SHA-256
- `sanitize_service_name(name: &str) -> Result<String, String>` — `[a-z0-9-]`, max 32 chars
- `DiagnosticCheck` struct, `DiagnosticStatus` enum (Pass/Warn/Fail)
- Unit tests for profile mapping, service_id computation, name sanitization

### `src/cli/commands/service/mod.rs` (NEW)

- Re-exports from submodules: `handle_install`, `handle_uninstall`, `handle_list`, `handle_status`, `handle_logs`, `handle_resume`, `handle_pause`, `handle_trigger`, `handle_repair`, `handle_doctor`, `handle_service_run`
- Shared `resolve_service_id(selector: Option<&str>, config_override: Option<&str>) -> Result<String>` helper: resolves `--service` flag, or derives from current config path. If multiple services exist and no selector provided, returns actionable error listing available services.
- Shared `acquire_admin_lock(service_id: &str) -> Result<AppLock>` helper: acquires `AppLock("service-admin-{service_id}")` for state mutation commands. Used by install, uninstall, pause, resume, and repair. NOT used by `service run` (which only acquires `sync_pipeline`).
- Imports from submodules

### `src/cli/commands/service/install.rs` (NEW)

- `handle_install(config_override, interval_str, profile, token_source, name, dry_run, robot_mode, start) -> Result<()>`
- Validates profile is one of `fast|balanced|full`
- Validates token_source is one of `env-file|embedded`
- Computes or validates `service_id` from `--name` or composite fingerprint (workspace root + config path + project URLs). If `--name` is provided and collides with an existing service with a different identity hash, returns an actionable error.
- Acquires admin lock `AppLock("service-admin-{service_id}")` before mutating any files
- Runs `doctor` pre-flight checks; aborts on any `Fail` result
- Loads config, resolves token, resolves binary path
- Writes token to env file (if env-file strategy, scoped by service_id)
- On macOS with env-file: generates wrapper script at `{data_dir}/service-run-{service_id}.sh` (mode 0700)
- Calls `platform::install(service_id, ...)`
- **Transactional**: on enable success, writes install manifest atomically. On enable failure, removes generated service files and wrapper script, returns `ServiceCommandFailed`.
- Compares with existing manifest to detect no-change case
- Prints result (robot JSON or human-readable)

### `src/cli/commands/service/uninstall.rs` (NEW)

- `handle_uninstall(service_selector, all, robot_mode, start) -> Result<()>`
- Resolves target service via selector or current-project default
- With `--all`: iterates all discovered manifests
- Reads manifest to find service_id
- Calls `platform::uninstall(service_id)`
- Removes install manifest (`service-manifest-{service_id}.json`)
- Removes env file (`service-env-{service_id}`) if exists
- Removes wrapper script (`service-run-{service_id}.sh`) if exists (macOS)
- Does NOT remove the status file or log files (those are operational data, not config)
- Outputs confirmation

### `src/cli/commands/service/status.rs` (NEW)

- `handle_status(config_override, robot_mode, start) -> Result<()>`
- Reads install manifest (primary source for config and service_id)
- Calls `platform::is_installed(service_id)`, `get_state(service_id)` to verify platform state
- Detects drift: platform drift (loaded/unloaded), spec drift (content hash vs `spec_hash`), command drift
- Reads `SyncStatusFile` for last sync and recent runs
- Detects stale runs via `current_run` metadata: checks if PID is alive and `started_at_ms` is within 30 minutes
- Computes scheduler state from status + manifest (including `degraded`, `running_stale`)
- Computes backoff info from persisted `next_retry_at_ms`
- Prints combined status

### `src/cli/commands/service/logs.rs` (NEW)

- `handle_logs(tail, follow, robot_mode, start) -> Result<()>`
- `--tail`: read last N lines, output directly to stdout (or as JSON array in robot mode)
- `--follow`: stream new lines (human mode only; robot mode returns error)
- Default (no flags): print last 100 lines to stdout (human) or return path metadata (robot)
- Robot mode with `--tail`: includes `last_lines` field (capped at 100)

### `src/cli/commands/service/resume.rs` (NEW)

- `handle_resume(robot_mode, start) -> Result<()>`
- Reads status file, clears paused state (including circuit breaker), writes back atomically
- Prints confirmation with previous reason

### `src/cli/commands/service/doctor.rs` (NEW)

- `handle_doctor(config_override, offline, fix, robot_mode, start) -> Result<()>`
- Runs diagnostic checks: config, token, binary, data dir, platform prerequisites, install state
- Skips network checks when `--offline`
- `--fix`: attempts safe, non-destructive remediations (create dirs, fix permissions, daemon-reload). Reports each applied fix.
- Reports pass/warn/fail per check
- Also used as pre-flight by `handle_install` (as an internal function call, without `--fix`)

### `src/cli/commands/service/run.rs` (NEW)

- `handle_service_run(service_id: &str, start) -> Result<()>`
- The hidden scheduled execution entrypoint; `service_id` is injected by the scheduler command line
- Reads manifest for the given `service_id` to get profile/interval/max_transient_failures/circuit_breaker_cooldown_seconds
- Checks paused state with half-open transition (cooldown check), backoff (via persisted next_retry_at_ms), pipeline lock
- Writes `current_run` metadata (started_at_ms, pid) to status file before sync for stale-run detection; clears it on completion
- Executes sync stage-by-stage, records per-stage outcomes with `error_code` propagation
- Classifies: success / degraded / failed
- Respects server-provided `Retry-After` hints when computing backoff (via `extract_retry_after_hint`)
- Circuit breaker check on transient failure count; records `circuit_breaker_paused_at_ms` for cooldown
- Half-open probe: if probe succeeds, auto-closes circuit breaker; if fails, returns to paused with new timestamp
- Performs log rotation check before executing sync
- Updates status atomically
- Always robot mode, always exit 0

### `src/cli/commands/service/list.rs` (NEW)

- `handle_list(robot_mode, start) -> Result<()>`
- Scans `{data_dir}` for `service-manifest-*.json` files
- Reads each manifest, verifies platform state, detects drift
- Outputs summary in robot JSON or human-readable table

### `src/cli/commands/service/pause.rs` (NEW)

- `handle_pause(service_selector, reason, robot_mode, start) -> Result<()>`
- Resolves service, writes `paused_reason` to status file
- Does NOT modify OS scheduler (service stays installed and scheduled — it just no-ops)
- Reports `already_paused: true` if already paused (updates reason)

### `src/cli/commands/service/trigger.rs` (NEW)

- `handle_trigger(service_selector, ignore_backoff, robot_mode, start) -> Result<()>`
- Resolves service, reads manifest for profile
- Delegates to `handle_service_run` logic with optional backoff bypass
- Still respects paused state (use `resume` first)

### `src/cli/commands/service/repair.rs` (NEW)

- `handle_repair(service_selector, robot_mode, start) -> Result<()>`
- Validates manifest and status files for JSON parseability
- Corrupt files: renamed to `{name}.corrupt.{timestamp}` (backup, never delete)
- Status file: reinitialized to default
- Manifest: cleared, advises reinstall
- Reports what was repaired

### `src/cli/commands/service/platform/mod.rs` (NEW)

- `#[cfg]`-gated imports and dispatch functions (all take `service_id`)
- `fn xml_escape(s: &str) -> String` helper (used by launchd)
- `fn run_cmd(program, args, timeout_secs) -> Result<String>` — shared command runner with kill+reap on timeout
- `fn wait_with_timeout_kill_and_reap(child, timeout_secs) -> Result<Output>` — timeout handler that kills and reaps child process
- `fn write_token_env_file(data_dir, service_id, token_env_var, token_value) -> Result<PathBuf>` — token storage
- `fn write_wrapper_script(data_dir, service_id, binary_path, token_env_var, config_path) -> Result<PathBuf>` — macOS wrapper script for runtime env loading (mode 0700)
- `fn check_prerequisites() -> Vec<DiagnosticCheck>` — platform-specific pre-flight
- `fn write_atomic(path: &Path, content: &str) -> std::io::Result<()>` — shared atomic write helper (tmp + fsync(file) + rename + fsync(parent_dir) for power-loss durability)

### `src/cli/commands/service/platform/launchd.rs` (NEW, `#[cfg(target_os = "macos")]`)

- `fn plist_path(service_id: &str) -> PathBuf` — `~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plist`
- `fn generate_plist(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, token_source, log_dir, data_dir) -> String` — generates plist with wrapper script (env-file) or direct invocation (embedded)
- `fn generate_plist_with_wrapper(service_id, wrapper_path, interval_seconds, log_dir) -> String` — env-file variant: ProgramArguments points to wrapper script
- `fn generate_plist_with_embedded(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, log_dir) -> String` — embedded variant: token in EnvironmentVariables
- `fn install(service_id, ...) -> Result<InstallResult>`
- `fn uninstall(service_id) -> Result<UninstallResult>`
- `fn is_installed(service_id) -> bool`
- `fn get_state(service_id) -> Option<String>`
- `fn get_interval_seconds(service_id) -> u64`
- `fn check_prerequisites() -> Vec<DiagnosticCheck>` — GUI session check
- Unit tests: `test_generate_plist_with_wrapper()` — verify wrapper path in ProgramArguments, no token in plist
- Unit tests: `test_generate_plist_with_embedded()` — verify token in EnvironmentVariables
- Unit tests: XML escaping, service_id in label

### `src/cli/commands/service/platform/systemd.rs` (NEW, `#[cfg(target_os = "linux")]`)

- `fn unit_dir() -> PathBuf` — `~/.config/systemd/user/`
- `fn generate_service(service_id, binary_path, config_path, token_env_var, token_value, token_source, data_dir) -> String` — includes hardening directives
- `fn generate_timer(service_id, interval_seconds) -> String`
- `fn install(service_id, ...) -> Result<InstallResult>`
- `fn uninstall(service_id) -> Result<UninstallResult>`
- Same query functions as launchd (all scoped by service_id)
- `fn check_prerequisites() -> Vec<DiagnosticCheck>` — user manager + linger checks
- Unit test: `test_generate_service()` (both env-file and embedded, verify hardening), `test_generate_timer()`

### `src/cli/commands/service/platform/schtasks.rs` (NEW, `#[cfg(target_os = "windows")]`)

- `fn install(service_id, ...) -> Result<InstallResult>`
- `fn uninstall(service_id) -> Result<UninstallResult>`
- Same query functions (scoped by service_id)
- `fn check_prerequisites() -> Vec<DiagnosticCheck>` — schtasks availability
- Note: `token_source: "system_env"` — token must be in system environment

---

## Testing Strategy

### Test Infrastructure

**Fake clock for deterministic time-dependent tests:**
```rust
/// Test clock with controllable time
struct FakeClock {
    now_ms: i64,
}

impl Clock for FakeClock {
    fn now_ms(&self) -> i64 {
        self.now_ms
    }
}
```

**Fake RNG for deterministic jitter tests:**
```rust
/// Test RNG that returns a predetermined sequence of values
struct FakeJitterRng {
    values: Vec<f64>,
    index: usize,
}

impl FakeJitterRng {
    fn new(values: Vec<f64>) -> Self {
        Self { values, index: 0 }
    }
}

impl JitterRng for FakeJitterRng {
    fn next_f64(&mut self) -> f64 {
        let val = self.values[self.index % self.values.len()];
        self.index += 1;
        val
    }
}
```

This eliminates all time- and randomness-dependent flakiness. Every test sets an explicit "now" and jitter value, then asserts exact results.

### Unit Tests (in `src/core/sync_status.rs`)

```rust
#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::TempDir;

    struct FakeClock { now_ms: i64 }
    impl Clock for FakeClock {
        fn now_ms(&self) -> i64 { self.now_ms }
    }

    struct FakeJitterRng { value: f64 }
    impl FakeJitterRng {
        fn new(value: f64) -> Self {
            Self { value }
        }
    }

    impl JitterRng for FakeJitterRng {
        fn next_f64(&mut self) -> f64 {
            self.value
        }
    }

    // --- Interval parsing ---

    #[test]
    fn parse_interval_valid_minutes() {
        assert_eq!(parse_interval("5m").unwrap(), 300);
        assert_eq!(parse_interval("15m").unwrap(), 900);
        assert_eq!(parse_interval("30m").unwrap(), 1800);
    }

    #[test]
    fn parse_interval_valid_hours() {
        assert_eq!(parse_interval("1h").unwrap(), 3600);
        assert_eq!(parse_interval("2h").unwrap(), 7200);
        assert_eq!(parse_interval("24h").unwrap(), 86400);
    }

    #[test]
    fn parse_interval_too_short() {
        assert!(parse_interval("1m").is_err());
        assert!(parse_interval("4m").is_err());
    }

    #[test]
    fn parse_interval_too_long() {
        assert!(parse_interval("25h").is_err());
    }

    #[test]
    fn parse_interval_invalid() {
        assert!(parse_interval("0m").is_err());
        assert!(parse_interval("abc").is_err());
        assert!(parse_interval("").is_err());
        assert!(parse_interval("m").is_err());
        assert!(parse_interval("10x").is_err());
        assert!(parse_interval("30s").is_err()); // seconds not supported
    }

    #[test]
    fn parse_interval_trims_whitespace() {
        assert_eq!(parse_interval("  30m  ").unwrap(), 1800);
    }

    // --- Status file persistence ---

    #[test]
    fn status_file_round_trip() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("sync-status-test1234.json");

        let mut status = SyncStatusFile::default();
        let run = SyncRunRecord {
            timestamp_iso: "2026-02-09T10:30:00Z".to_string(),
            timestamp_ms: 1_770_609_000_000,
            duration_seconds: 12.5,
            outcome: "success".to_string(),
            stage_results: vec![
                StageResult { stage: "issues".into(), success: true, items_updated: 5, error: None },
                StageResult { stage: "mrs".into(), success: true, items_updated: 3, error: None },
            ],
            error_message: None,
        };
        status.record_run(run);
        status.write_atomic(&path).unwrap();

        let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
        assert_eq!(loaded.last_run.as_ref().unwrap().outcome, "success");
        assert_eq!(loaded.last_run.as_ref().unwrap().stage_results.len(), 2);
        assert_eq!(loaded.consecutive_failures, 0);
        assert_eq!(loaded.recent_runs.len(), 1);
        assert_eq!(loaded.schema_version, 1);
    }

    #[test]
    fn status_file_read_missing_returns_ok_none() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("nonexistent.json");
        assert!(SyncStatusFile::read(&path).unwrap().is_none());
    }

    #[test]
    fn status_file_read_corrupt_returns_err() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("corrupt.json");
        std::fs::write(&path, "not valid json{{{").unwrap();
        assert!(SyncStatusFile::read(&path).is_err());
    }

    #[test]
    fn status_file_atomic_write_survives_crash() {
        // Verify no partial writes by checking file is valid JSON after write
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("sync-status-test1234.json");
        let status = SyncStatusFile::default();
        status.write_atomic(&path).unwrap();
        // Read back and verify
        let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
        assert_eq!(loaded.schema_version, 1);
    }

    #[test]
    fn record_run_caps_at_10() {
        let mut status = SyncStatusFile::default();
        for i in 0..15 {
            status.record_run(make_run(i * 1000, "success"));
        }
        assert_eq!(status.recent_runs.len(), 10);
    }

    #[test]
    fn default_status_has_no_last_run() {
        let status = SyncStatusFile::default();
        assert!(status.last_run.is_none());
    }

    // --- Backoff (deterministic via FakeClock + persisted next_retry_at_ms) ---

    #[test]
    fn backoff_returns_none_when_zero_failures() {
        let status = make_status("success", 0, 100_000);
        let clock = FakeClock { now_ms: 200_000 };
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn backoff_returns_none_when_no_next_retry() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = None;
        let clock = FakeClock { now_ms: 200_000_000 };
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn backoff_active_within_window() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = Some(100_000_000 + 1_800_000); // 30 min from now
        let clock = FakeClock { now_ms: 100_000_000 + 1000 }; // 1s after failure
        let remaining = status.backoff_remaining(&clock);
        assert!(remaining.is_some());
        assert_eq!(remaining.unwrap(), 1799);
    }

    #[test]
    fn backoff_expired() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = Some(100_000_000 + 1_800_000);
        let clock = FakeClock { now_ms: 100_000_000 + 2_000_000 }; // past retry time
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn set_backoff_persists_next_retry() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.5); // 0.5 for deterministic
        status.set_backoff(1800, &clock, &mut rng, None);
        assert!(status.next_retry_at_ms.is_some());
        // With jitter=0.5, backoff = max(1800*0.5, 1800) = 1800s
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_caps_at_4_hours() {
        let mut status = make_status("failed", 20, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(1.0); // max jitter
        status.set_backoff(1800, &clock, &mut rng, None);
        // Cap: 4h = 14400s, jitter=1.0: max(14400*1.0, 1800) = 14400
        let max_ms = 100_000_000 + 14_400_000;
        assert!(status.next_retry_at_ms.unwrap() <= max_ms);
    }

    #[test]
    fn set_backoff_minimum_is_base_interval() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0); // min jitter
        status.set_backoff(1800, &clock, &mut rng, None);
        // jitter=0.0: max(1800*0.0, 1800) = 1800 (minimum enforced)
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_respects_retry_after_hint() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0); // min jitter => computed backoff = 1800s
        let hint = 100_000_000 + 3_600_000; // server says retry after 1 hour
        status.set_backoff(1800, &clock, &mut rng, Some(hint));
        // Hint (1h) > computed backoff (30m), so hint wins
        assert_eq!(status.next_retry_at_ms.unwrap(), hint);
    }

    #[test]
    fn set_backoff_ignores_hint_when_computed_is_larger() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0);
        let hint = 100_000_000 + 60_000; // server says retry after 1 minute
        status.set_backoff(1800, &clock, &mut rng, Some(hint));
        // Computed (30m) > hint (1m), so computed wins
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_uses_configured_interval_not_hardcoded() {
        let mut status1 = make_status("failed", 1, 100_000_000);
        let mut status2 = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.5);

        status1.set_backoff(300, &clock, &mut rng, None);  // 5m base
        rng.value = 0.5; // reset
        status2.set_backoff(3600, &clock, &mut rng, None); // 1h base

        // 5m base should produce shorter backoff than 1h base
        assert!(status1.next_retry_at_ms.unwrap() < status2.next_retry_at_ms.unwrap());
    }

    #[test]
    fn backoff_skips_when_paused() {
        let mut status = make_status("failed", 3, 100_000_000);
        status.paused_reason = Some("AUTH_FAILED".to_string());
        status.next_retry_at_ms = Some(100_000_000 + 999_999_999);
        let clock = FakeClock { now_ms: 100_000_000 + 1000 };
        // Paused state is checked separately, backoff_remaining returns None
        assert!(status.backoff_remaining(&clock).is_none());
    }

    // --- Error classification ---

    #[test]
    fn permanent_errors_classified_correctly() {
        assert!(is_permanent_error(&ErrorCode::TokenNotSet));
        assert!(is_permanent_error(&ErrorCode::AuthFailed));
        assert!(is_permanent_error(&ErrorCode::ConfigNotFound));
        assert!(is_permanent_error(&ErrorCode::ConfigInvalid));
        assert!(is_permanent_error(&ErrorCode::MigrationFailed));
    }

    #[test]
    fn transient_errors_classified_correctly() {
        assert!(!is_permanent_error(&ErrorCode::NetworkError));
        assert!(!is_permanent_error(&ErrorCode::RateLimited));
        assert!(!is_permanent_error(&ErrorCode::DbLocked));
        assert!(!is_permanent_error(&ErrorCode::DbError));
        assert!(!is_permanent_error(&ErrorCode::InternalError));
    }

    // --- Stage-aware outcomes ---

    #[test]
    fn degraded_outcome_does_not_count_as_failure() {
        // When core stages succeed but optional stages fail, consecutive_failures should reset
        let mut status = make_status("failed", 3, 100_000_000);
        status.next_retry_at_ms = Some(200_000_000);

        // Simulate degraded outcome clearing failure state
        status.consecutive_failures = 0;
        status.next_retry_at_ms = None;
        assert_eq!(status.consecutive_failures, 0);
        assert!(status.next_retry_at_ms.is_none());
    }

    // --- Backoff (service run only, NOT manual sync) ---
    // (Test degraded state by running with --profile full when Ollama is down)
    // Embeddings should fail, but issues/MRs should succeed
    #[test]
    fn service_run_degraded_outcome_clears_failures() {
        let mut status = make_status("failed", 3, 100_000_000);
        status.consecutive_failures = 3;
        status.next_retry_at_ms = Some(200_000_000);

        // Simulate degraded outcome clearing failure state
        status.consecutive_failures = 0;
        status.next_retry_at_ms = None;
        assert_eq!(status.consecutive_failures, 0);
        assert!(status.next_retry_at_ms.is_none());
    }

    // --- Circuit breaker ---
    #[test]
    fn circuit_breaker_trips_at_threshold() {
        let mut status = make_status("failed", 9, 100_000_000);
        // Incrementing to 10 should trigger circuit breaker
        status.consecutive_failures = status.consecutive_failures.saturating_add(1);
        assert_eq!(status.consecutive_failures, 10);
        // Caller would set paused_reason = "CIRCUIT_BREAKER"
    }

    // --- Paused state (permanent error) ---
    #[test]
    fn clear_paused_resets_all_fields() {
        let mut status = make_status("failed", 5, 100_000_000);
        status.paused_reason = Some("AUTH_FAILED: 401 Unauthorized".to_string());
        status.last_error_code = Some("AUTH_FAILED".to_string());
        status.last_error_message = Some("401 Unauthorized".to_string());
        status.next_retry_at_ms = Some(200_000_000);
        status.circuit_breaker_paused_at_ms = Some(100_000_000);
        status.clear_paused();
        assert!(status.paused_reason.is_none());
        assert!(status.circuit_breaker_paused_at_ms.is_none());
        assert!(status.last_error_code.is_none());
        assert!(status.last_error_message.is_none());
        assert!(status.next_retry_at_ms.is_none());
        assert_eq!(status.consecutive_failures, 0);
    }

    #[test]
    fn clear_paused_also_clears_circuit_breaker() {
        let mut status = make_status("failed", 10, 100_000_000);
        status.paused_reason = Some("CIRCUIT_BREAKER: 10 consecutive transient failures".to_string());
        status.clear_paused();
        assert!(status.paused_reason.is_none());
        assert_eq!(status.consecutive_failures, 0);
    }

    fn make_run(ts_ms: i64, outcome: &str) -> SyncRunRecord {
        SyncRunRecord {
            timestamp_iso: String::new(),
            timestamp_ms: ts_ms,
            duration_seconds: 1.0,
            outcome: outcome.to_string(),
            stage_results: vec![],
            error_message: if outcome == "failed" {
                Some("test error".into())
            } else {
                None
            },
        }
    }

    fn make_stage_result(stage: &str, success: bool, error_code: Option<&str>) -> StageResult {
        StageResult {
            stage: stage.to_string(),
            success,
            items_updated: if success { 5 } else { 0 },
            error: if success { None } else { Some("test error".into()) },
            error_code: error_code.map(|s| s.to_string()),
        }
    }

    fn make_status(outcome: &str, failures: u32, ts_ms: i64) -> SyncStatusFile {
        let run = make_run(ts_ms, outcome);
        SyncStatusFile {
            schema_version: 1,
            updated_at_iso: String::new(),
            last_run: Some(run.clone()),
            recent_runs: vec![run],
            consecutive_failures: failures,
            next_retry_at_ms: None,
            paused_reason: None,
            circuit_breaker_paused_at_ms: None,
            last_error_code: None,
            last_error_message: None,
            current_run: None,
        }
    }
}
```

### Service Manifest Tests (in `src/core/service_manifest.rs`)

```rust
#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::TempDir;

    #[test]
    fn manifest_round_trip() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("manifest.json");
        let manifest = ServiceManifest {
            schema_version: 1,
            service_id: "a1b2c3d4e5f6".to_string(),
            workspace_root: "/Users/x/projects/my-project".to_string(),
            installed_at_iso: "2026-02-09T10:00:00Z".to_string(),
            updated_at_iso: "2026-02-09T10:00:00Z".to_string(),
            platform: "launchd".to_string(),
            interval_seconds: 900,
            profile: "fast".to_string(),
            binary_path: "/usr/local/bin/lore".to_string(),
            config_path: None,
            token_source: "env_file".to_string(),
            token_env_var: "GITLAB_TOKEN".to_string(),
            service_files: vec!["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist".to_string()],
            sync_command: "/usr/local/bin/lore --robot service run".to_string(),
            max_transient_failures: 10,
            circuit_breaker_cooldown_seconds: 1800,
            spec_hash: "abc123def456".to_string(),
        };
        manifest.write_atomic(&path).unwrap();
        let loaded = ServiceManifest::read(&path).unwrap().unwrap();
        assert_eq!(loaded.profile, "fast");
        assert_eq!(loaded.interval_seconds, 900);
        assert_eq!(loaded.service_id, "a1b2c3d4e5f6");
        assert_eq!(loaded.max_transient_failures, 10);
        assert_eq!(loaded.circuit_breaker_cooldown_seconds, 1800);
    }

    #[test]
    fn manifest_read_missing_returns_ok_none() {
        let dir = TempDir::new().unwrap();
        assert!(ServiceManifest::read(&dir.path().join("nope.json")).unwrap().is_none());
    }

    #[test]
    fn manifest_read_corrupt_returns_err() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("bad.json");
        std::fs::write(&path, "{{{{").unwrap();
        assert!(ServiceManifest::read(&path).is_err());
    }

    #[test]
    fn profile_to_sync_args_fast() {
        let m = make_manifest("fast");
        assert_eq!(m.profile_to_sync_args(), vec!["--no-docs", "--no-embed"]);
    }

    #[test]
    fn profile_to_sync_args_balanced() {
        let m = make_manifest("balanced");
        assert_eq!(m.profile_to_sync_args(), vec!["--no-embed"]);
    }

    #[test]
    fn profile_to_sync_args_full() {
        let m = make_manifest("full");
        assert!(m.profile_to_sync_args().is_empty());
    }

    #[test]
    fn compute_service_id_deterministic() {
        let urls = ["https://gitlab.com/group/repo"];
        let id1 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
        let id2 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
        assert_eq!(id1, id2);
        assert_eq!(id1.len(), 12);
    }

    #[test]
    fn compute_service_id_different_workspaces() {
        let urls = ["https://gitlab.com/group/repo"];
        let config = Path::new("/home/user/.config/lore/config.json");
        let id1 = compute_service_id(Path::new("/home/user/project-a"), config, &urls);
        let id2 = compute_service_id(Path::new("/home/user/project-b"), config, &urls);
        assert_ne!(id1, id2); // Same config, different workspace => different IDs
    }

    #[test]
    fn compute_service_id_different_configs() {
        let urls = ["https://gitlab.com/group/repo"];
        let workspace = Path::new("/home/user/project");
        let id1 = compute_service_id(workspace, Path::new("/home/user1/config.json"), &urls);
        let id2 = compute_service_id(workspace, Path::new("/home/user2/config.json"), &urls);
        assert_ne!(id1, id2);
    }

    #[test]
    fn compute_service_id_different_projects_same_config() {
        let workspace = Path::new("/home/user/project");
        let config = Path::new("/home/user/.config/lore/config.json");
        let id1 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-a"]);
        let id2 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-b"]);
        assert_ne!(id1, id2); // Same config path, different projects => different IDs
    }

    #[test]
    fn compute_service_id_url_order_independent() {
        let workspace = Path::new("/home/user/project");
        let config = Path::new("/config.json");
        let id1 = compute_service_id(workspace, config, &["https://gitlab.com/a", "https://gitlab.com/b"]);
        let id2 = compute_service_id(workspace, config, &["https://gitlab.com/b", "https://gitlab.com/a"]);
        assert_eq!(id1, id2); // Order should not matter (sorted internally)
    }

    #[test]
    fn sanitize_service_name_valid() {
        assert_eq!(sanitize_service_name("my-project").unwrap(), "my-project");
        assert_eq!(sanitize_service_name("MyProject").unwrap(), "myproject");
    }

    #[test]
    fn sanitize_service_name_special_chars() {
        assert_eq!(sanitize_service_name("my project!").unwrap(), "my-project-");
    }

    #[test]
    fn sanitize_service_name_empty_rejects() {
        assert!(sanitize_service_name("---").is_err());
        assert!(sanitize_service_name("").is_err());
    }

    #[test]
    fn sanitize_service_name_too_long() {
        let long_name = "a".repeat(33);
        assert!(sanitize_service_name(&long_name).is_err());
    }

    fn make_manifest(profile: &str) -> ServiceManifest { /* ... */ }
}
```

### Platform-Specific Unit Tests

```rust
// In platform/launchd.rs
#[cfg(test)]
mod tests {
    use super::*;

    // --- Wrapper script variant (env-file, default) ---

    #[test]
    fn plist_wrapper_contains_scoped_label() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>com.gitlore.sync.abc123</string>"));
    }

    #[test]
    fn plist_wrapper_invokes_wrapper_not_lore_directly() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>/data/service-run-abc123.sh</string>"));
        // Should NOT contain direct lore invocation args
        assert!(!plist.contains("<string>--robot</string>"));
        assert!(!plist.contains("<string>service</string>"));
    }

    #[test]
    fn plist_wrapper_does_not_contain_token() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(!plist.contains("GITLAB_TOKEN"));
        assert!(!plist.contains("glpat"));
    }

    #[test]
    fn plist_wrapper_contains_interval() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 900, Path::new("/tmp/logs"));
        assert!(plist.contains("<integer>900</integer>"));
    }

    // --- Embedded variant ---

    #[test]
    fn plist_embedded_contains_token() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("GITLAB_TOKEN"));
        assert!(plist.contains("glpat-xxx"));
    }

    #[test]
    fn plist_embedded_invokes_lore_directly() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("<string>--robot</string>"));
        assert!(plist.contains("<string>service</string>"));
        assert!(plist.contains("<string>run</string>"));
    }

    #[test]
    fn plist_embedded_xml_escapes_token() {
        let plist = generate_plist_with_embedded(
            "abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "tok&en<>", Path::new("/tmp/logs"),
        );
        assert!(plist.contains("tok&amp;en&lt;&gt;"));
        assert!(!plist.contains("tok&en<>"));
    }

    #[test]
    fn plist_xml_escapes_paths_with_special_chars() {
        let plist = generate_plist_with_embedded(
            "abc123", "/Users/O'Brien/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx",
            Path::new("/tmp/logs"),
        );
        assert!(plist.contains("O&apos;Brien"));
    }

    // --- Shared plist properties ---

    #[test]
    fn plist_has_background_process_type() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>Background</string>"));
        assert!(plist.contains("<integer>10</integer>")); // Nice
    }

    #[test]
    fn plist_embedded_includes_config_path_when_provided() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", Some("/custom/config.json"), 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("LORE_CONFIG_PATH"));
        assert!(plist.contains("/custom/config.json"));
    }
}

// In platform/systemd.rs
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn service_unit_contains_hardening() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
        assert!(unit.contains("NoNewPrivileges=true"));
        assert!(unit.contains("PrivateTmp=true"));
        assert!(unit.contains("ProtectSystem=strict"));
        assert!(unit.contains("ProtectHome=read-only"));
        assert!(unit.contains("TimeoutStartSec=900"));
        assert!(unit.contains("WorkingDirectory=/data"));
        assert!(unit.contains("SuccessExitStatus=0"));
    }

    #[test]
    fn service_unit_env_file_mode() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
        assert!(unit.contains("EnvironmentFile=/data/service-env-abc123"));
        assert!(!unit.contains("Environment=GITLAB_TOKEN="));
    }

    #[test]
    fn service_unit_embedded_mode() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "embedded", Path::new("/data"));
        assert!(unit.contains("Environment=GITLAB_TOKEN=glpat-xxx"));
        assert!(!unit.contains("EnvironmentFile="));
    }

    #[test]
    fn timer_unit_contains_scoped_description() {
        let timer = generate_timer("abc123", 900);
        assert!(timer.contains("abc123"));
        assert!(timer.contains("OnUnitInactiveSec=900s"));
    }
}
```

### Integration Tests (CLI parsing)

```rust
// In service/mod.rs
#[cfg(test)]
mod tests {
    use clap::Parser;
    use crate::cli::Cli;

    #[test]
    fn parse_service_install_default() {
        let cli = Cli::try_parse_from(["lore", "service", "install"]).unwrap();
        match cli.command {
            Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
                assert_eq!(interval, "30m");
                assert_eq!(profile, "balanced");
                assert_eq!(token_source, "env-file");
                assert!(name.is_none());
            }
            _ => panic!("Expected Service Install"),
        }
    }

    #[test]
    fn parse_service_install_all_flags() {
        let cli = Cli::try_parse_from([
            "lore", "service", "install",
            "--interval", "1h",
            "--profile", "fast",
            "--token-source", "embedded",
            "--name", "my-project",
        ]).unwrap();
        match cli.command {
            Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
                assert_eq!(interval, "1h");
                assert_eq!(profile, "fast");
                assert_eq!(token_source, "embedded");
                assert_eq!(name.as_deref(), Some("my-project"));
            }
            _ => panic!("Expected Service Install"),
        }
    }

    #[test]
    fn parse_service_uninstall() {
        let cli = Cli::try_parse_from(["lore", "service", "uninstall"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Uninstall })
        ));
    }

    #[test]
    fn parse_service_status() {
        let cli = Cli::try_parse_from(["lore", "service", "status"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Status })
        ));
    }

    #[test]
    fn parse_service_logs_default() {
        let cli = Cli::try_parse_from(["lore", "service", "logs"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Logs { .. } })
        ));
    }

    #[test]
    fn parse_service_logs_with_tail() {
        let cli = Cli::try_parse_from(["lore", "service", "logs", "--tail", "50"]).unwrap();
        // Verify tail flag is parsed
    }

    #[test]
    fn parse_service_resume() {
        let cli = Cli::try_parse_from(["lore", "service", "resume"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Resume })
        ));
    }

    #[test]
    fn parse_service_doctor() {
        let cli = Cli::try_parse_from(["lore", "service", "doctor"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Doctor { .. } })
        ));
    }

    #[test]
    fn parse_service_doctor_offline() {
        let cli = Cli::try_parse_from(["lore", "service", "doctor", "--offline"]).unwrap();
        // Verify offline flag is parsed
    }

    #[test]
    fn parse_service_run_hidden() {
        let cli = Cli::try_parse_from(["lore", "service", "run"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Run })
        ));
    }
}
```

### Behavioral Tests (service run isolation)

```rust
// Verify that manual sync path is NOT affected by service state
#[test]
fn manual_sync_ignores_backoff_state() {
    // Create a status file with active backoff
    let dir = TempDir::new().unwrap();
    let status_path = dir.path().join("sync-status-test1234.json");
    let mut status = make_status("failed", 5, chrono::Utc::now().timestamp_millis());
    status.next_retry_at_ms = Some(chrono::Utc::now().timestamp_millis() + 999_999_999);
    status.write_atomic(&status_path).unwrap();

    // handle_sync_cmd should NOT read this file at all
    // (verified by the absence of any backoff check in handle_sync_cmd)
}

// Verify service run respects paused state
#[test]
fn service_run_respects_paused_state() {
    let mut status = SyncStatusFile::default();
    status.paused_reason = Some("AUTH_FAILED".to_string());
    // handle_service_run should check paused_reason BEFORE backoff
    // and exit with action: "paused"
}

// Verify degraded outcome clears failure counter
#[test]
fn service_run_degraded_clears_failures() {
    let mut status = make_status("failed", 3, 100_000_000);
    status.next_retry_at_ms = Some(200_000_000);
    // After a degraded run (core OK, optional failed):
    status.consecutive_failures = 0;
    status.next_retry_at_ms = None;
    assert_eq!(status.consecutive_failures, 0);
}

// Verify circuit breaker trips at threshold
#[test]
fn service_run_circuit_breaker_trips() {
    let mut status = make_status("failed", 9, 100_000_000);
    status.consecutive_failures = status.consecutive_failures.saturating_add(1);
    // At 10 failures, should set paused_reason
    if status.consecutive_failures >= 10 {
        status.paused_reason = Some("CIRCUIT_BREAKER".to_string());
    }
    assert!(status.paused_reason.is_some());
}
```

---

## New Dependencies

**Two new crates:**

| Crate | Version | Purpose | Justification |
|-------|---------|---------|---------------|
| `sha2` | `0.10` | Compute `service_id` from config path | Small, well-audited, no-std compatible. Used for exactly one hash computation. |
| `hex` | `0.4` | Encode hash bytes to hex string | Tiny utility, widely used. |

> **Note on `rand`:** The `JitterRng` trait uses `rand::thread_rng()` in production. Check if `rand` is already a transitive dependency (via other crates). If so, add it as a direct dependency. If not, consider using a simpler PRNG or system randomness via `getrandom` to avoid pulling in the full `rand` crate for a single call site. The `JitterRng` trait abstracts this, so the implementation can change without affecting the API.

**Existing dependencies used:**
- `std::process::Command` — for launchctl, systemctl, schtasks
- `format!()` — for plist XML and systemd unit templates
- `std::env::current_exe()` — for binary path resolution
- `serde` + `serde_json` (existing) — for status/manifest files
- `chrono` (existing) — for timestamps
- `dirs` (existing) — for home directory
- `libc` (existing, unix only) — for `getuid()`
- `console` (existing) — for colored human output
- `tempfile` (existing, dev dep) — for test temp dirs

---

## Implementation Order

### Phase 1: Core types (standalone, fully testable)
1. `Cargo.toml` — add `sha2`, `hex` dependencies (and `rand` if not already transitive)
2. `src/core/sync_status.rs` — `SyncRunRecord`, `StageResult` (with `error_code`), `SyncStatusFile` (with `circuit_breaker_paused_at_ms`, `current_run`), `CurrentRunState`, `Clock` trait, `JitterRng` trait, `parse_interval`, `is_permanent_error`, `is_permanent_stage_error`, `is_circuit_breaker_half_open`, `extract_retry_after_hint`, atomic write helper, schema migration on read, all unit tests
3. `src/core/service_manifest.rs` — `ServiceManifest` (with `circuit_breaker_cooldown_seconds`, `workspace_root`, `spec_hash`), `DiagnosticCheck`, `DiagnosticStatus`, `compute_service_id(workspace_root, config_path, project_urls)`, `sanitize_service_name`, `compute_spec_hash(service_files_content)`, profile mapping, atomic write helper, schema migration on read, unit tests
4. `src/core/error.rs` — add `ServiceError`, `ServiceUnsupported`, `ServiceCommandFailed`, `ServiceCorruptState`
5. `src/core/paths.rs` — add `get_service_status_path(service_id)`, `get_service_manifest_path(service_id)`, `get_service_env_path(service_id)`, `get_service_wrapper_path(service_id)`, `get_service_log_path(service_id, stream)`, `list_service_ids()`
6. `src/core/mod.rs` — add `pub mod sync_status; pub mod service_manifest;`

### Phase 2: Platform backends (parallelizable across platforms)
7. `src/cli/commands/service/platform/mod.rs` — dispatch functions (with `service_id`), `run_cmd` (with kill+reap on timeout), `wait_with_timeout_kill_and_reap`, `xml_escape`, `write_token_env_file`, `write_wrapper_script`, `write_atomic`, `check_prerequisites`
8. `src/cli/commands/service/platform/launchd.rs` — macOS backend with wrapper script (env-file) and embedded variants, project-scoped label + prerequisite checks + tests
9. `src/cli/commands/service/platform/systemd.rs` — Linux backend with hardened unit (WorkingDirectory, SuccessExitStatus), project-scoped names, linger/user-manager checks + tests
10. `src/cli/commands/service/platform/schtasks.rs` — Windows backend with project-scoped task name

### Phase 3: Command handlers
11. `src/cli/commands/service/doctor.rs` — pre-flight diagnostic checks (used by install and standalone)
12. `src/cli/commands/service/install.rs` — install handler with transactional ordering (enable then manifest), wrapper script generation, doctor pre-flight, service_id
13. `src/cli/commands/service/uninstall.rs` — uninstall handler with `--service`/`--all` selectors (removes manifest + env file + wrapper script)
14. `src/cli/commands/service/list.rs` — list handler (scans data_dir for manifests, verifies platform state)
15. `src/cli/commands/service/status.rs` — status handler with scheduler state including `degraded` and `half_open`
16. `src/cli/commands/service/logs.rs` — logs handler with default tail output, `--open` for editor, `--follow`, log rotation check
17. `src/cli/commands/service/resume.rs` — resume handler (clears paused + circuit breaker)
18. `src/cli/commands/service/pause.rs` — pause handler (sets manual pause reason)
19. `src/cli/commands/service/trigger.rs` — trigger handler (immediate run with optional backoff bypass)
20. `src/cli/commands/service/repair.rs` — repair handler (backup corrupt files, reinitialize)
21. `src/cli/commands/service/run.rs` — hidden scheduled execution entrypoint with stage-aware execution, circuit breaker, half-open probe, log rotation
22. `src/cli/commands/service/mod.rs` — re-exports + `resolve_service_id` helper

### Phase 4: CLI wiring
23. `src/cli/mod.rs` — `ServiceCommand` in `Commands` enum (with all new subcommands and flags)
24. `src/cli/commands/mod.rs` — `pub mod service;`
25. `src/main.rs` — dispatch + pipeline lock in `handle_sync_cmd` + robot-docs manifest
26. `src/cli/autocorrect.rs` — add service entry with all flags

### Phase 5: Verification
27. `cargo check --all-targets && cargo clippy --all-targets -- -D warnings && cargo test && cargo fmt --check`

---

## Verification Checklist

```bash
# Build and lint
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt --check

# Run all tests
cargo test

# --- Doctor (run first to verify prerequisites) ---
cargo run --release -- service doctor
cargo run --release -- -J service doctor | jq '.data.overall'  # should show "pass" or "warn"
cargo run --release -- -J service doctor --offline | jq .
cargo run --release -- -J service doctor --fix | jq '.data.checks[] | select(.status == "fixed")'

# --- Dry-run install (should write nothing) ---
cargo run --release -- -J service install --interval 15m --profile fast --dry-run | jq '.data.dry_run'  # true
launchctl list | grep gitlore  # should NOT be present

# --- Install (macOS) ---
cargo run --release -- service install --interval 15m --profile fast
launchctl list | grep gitlore
cargo run --release -- -J service status | jq '.data.service_id'  # should show hash
cargo run --release -- service logs --tail 5
cargo run --release -- service uninstall
launchctl list | grep gitlore  # should be gone

# Verify install with custom name
cargo run --release -- service install --interval 30m --name my-project
launchctl list | grep gitlore  # should show com.gitlore.sync.my-project
cargo run --release -- -J service status | jq '.data.service_id'  # "my-project"
cargo run --release -- service uninstall

# Verify install idempotency
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service install --interval 30m  # should report no_change: true
cargo run --release -- -J service install --interval 15m  # should report changes
cargo run --release -- service uninstall

# --- Service run (use `service trigger` for manual testing, or provide --service-id) ---
cargo run --release -- -J service install --interval 30m
SVC_ID=$(cargo run --release -- -J service status | jq -r '.data.service_id')
cargo run --release -- -J service trigger  # preferred way to manually invoke a service run
cargo run --release -- -J service status | jq '.data.recent_runs'  # should show the run
cargo run --release -- -J service status | jq '.data.last_sync.stage_results'  # per-stage outcomes

# --- Stage-aware outcomes ---
# (Test degraded state by running with --profile full when Ollama is down)
# Embeddings should fail, but issues/MRs should succeed
cargo run --release -- -J service install --profile full
# Stop Ollama, then:
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.outcome'  # "degraded"
cargo run --release -- -J service status | jq '.data.scheduler_state'  # "degraded"

# --- Backoff (service run only, NOT manual sync) ---
# 1. Create a status file simulating failures
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":TIMESTAMP,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"test"},
  "recent_runs": [],
  "consecutive_failures": 3,
  "next_retry_at_ms": FUTURE_MS,
  "paused_reason": null,
  "last_error_code": null,
  "last_error_message": null,
  "circuit_breaker_paused_at_ms": null
}
EOF
# Replace timestamps: sed -i '' "s/TIMESTAMP/$(date +%s)000/;s/FUTURE_MS/$(($(date +%s)*1000 + 3600000))/" ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json

# 2. Service run should skip (backoff)
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "skipped"

# 3. Manual sync should NOT be affected
cargo run --release -- sync  # should proceed normally

# --- Paused state (permanent error) ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"401 Unauthorized"},
  "recent_runs": [],
  "consecutive_failures": 1,
  "next_retry_at_ms": null,
  "paused_reason": "AUTH_FAILED: 401 Unauthorized",
  "last_error_code": "AUTH_FAILED",
  "last_error_message": "401 Unauthorized",
  "circuit_breaker_paused_at_ms": null
}
EOF

# Service run should report paused
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason'  # "AUTH_FAILED"

# Resume clears the state
cargo run --release -- -J service resume | jq .  # clears circuit breaker

# --- Circuit breaker ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"connection refused"},
  "recent_runs": [],
  "consecutive_failures": 10,
  "next_retry_at_ms": null,
  "paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures",
  "last_error_code": "TRANSIENT",
  "last_error_message": "connection refused",
  "circuit_breaker_paused_at_ms": 1770609000000
}
EOF
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason'  # "CIRCUIT_BREAKER"
cargo run --release -- -J service resume | jq .  # clears circuit breaker

# --- Robot mode for all commands ---
cargo run --release -- -J service install --interval 30m | jq .
cargo run --release -- -J service list | jq .
cargo run --release -- -J service status | jq .
cargo run --release -- -J service logs --tail 10 | jq .
cargo run --release -- -J service doctor | jq .
cargo run --release -- -J service pause --reason "test" | jq .
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq .
cargo run --release -- -J service repair | jq .
cargo run --release -- -J service uninstall | jq .

# --- New operational commands ---
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service pause --reason "maintenance"
cargo run --release -- -J service status | jq '.data.scheduler_state'  # "paused"
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq .  # immediate sync
cargo run --release -- -J service list | jq '.data.services'
cargo run --release -- service uninstall

# --- Token env file security (macOS/Linux) ---
cargo run --release -- service install --interval 30m
ls -la ~/.local/share/lore/service-env-*  # should show -rw------- permissions
# On macOS, verify wrapper script exists and token NOT in plist:
ls -la ~/.local/share/lore/service-run-*  # should show -rwx------ permissions
grep -c GITLAB_TOKEN ~/Library/LaunchAgents/com.gitlore.sync.*.plist  # should be 0 (env-file mode)
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-env-*  # should be gone (uninstall removes it)
ls ~/.local/share/lore/service-run-*  # should be gone (uninstall removes wrapper)

# --- Manifest persistence ---
cargo run --release -- service install --interval 15m --profile full
cat ~/.local/share/lore/service-manifest-*.json | jq .  # should show manifest with service_id
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-manifest-*  # should be gone

# --- Logs with tail/follow ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service run --service-id $SVC_ID # generate some log output
cargo run --release -- service logs --tail 20  # show last 20 lines
# cargo run --release -- service logs --follow  # (interactive — Ctrl-C to stop)

# --- Uninstall cleanup ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service uninstall | jq '.data.removed_files'
# Verify status file and logs are kept
ls ~/.local/share/lore/sync-status-*.json  # should exist
ls ~/.local/share/lore/logs/  # should exist

# --- Repair command ---
# Corrupt a status file to test repair
echo "{{{" > ~/.local/share/lore/sync-status-test.json
cargo run --release -- -J service repair | jq .  # should backup and reinitialize

# --- Final cleanup ---
cargo run --release -- service uninstall 2>/dev/null
rm -f ~/.local/share/lore/sync-status-*.json
```

---

## Rejected Recommendations

Recommendations from external reviewers that were considered and explicitly rejected. Kept here to prevent re-proposal.

- **Unified `SyncOrchestrator` for manual and scheduled sync** (feedback-4, rec 4) — rejected because manual and scheduled sync have fundamentally different policies (backoff/circuit-breaker vs. none). A shared orchestrator adds abstraction without clear benefit. The current approach (separate paths with shared pipeline lock) is simpler, correct, and avoids coupling the manual path to service-layer concerns. The two paths share the sync pipeline implementation itself; only the policy wrapper differs.

- **`auto` token strategy with secure-store (Keychain / libsecret / Credential Manager) as default** (feedback-2 rec 2, feedback-4 rec 7) — rejected because adding platform-specific secure store dependencies (`security-framework`, `libsecret`, `winapi`) is heavy for v1. The wrapper-script approach (already in the plan) keeps the token out of the plist safely on macOS. The plan notes secure-store as a future enhancement. The token validation fix (rejecting NUL/newline) from feedback-4 rec 7 was accepted separately.

- **Store service state in SQLite instead of JSON status file** (feedback-1, rec 2) — rejected because the status file is intentionally independent of the database. This avoids coupling service lifecycle to DB migrations, enables service operation when the DB is locked/corrupt, and keeps the service layer self-contained. The JSON file approach with atomic writes is adequate for single-writer status tracking.

- **`write_seq` and `content_sha256` integrity fields in manifest/status files** (feedback-4, rec 6 partial) — rejected because this is over-engineering for a status file that is written by a single process with atomic writes. The `service repair` command already handles corrupt files by backup+reinit. The fsync(parent_dir) improvement from rec 6 was accepted separately.

- **Use `nix` crate for safe UID access** (feedback-4, rec 8 partial) — rejected as a mandatory dependency because `getuid()` is trivially safe (no pointers, no mutation) and adding `nix` for a single call is disproportionate. A single-line safe wrapper with `#[allow(unsafe_code)]` is sufficient. If `nix` is already a dependency for other reasons, using it is fine.

- **Mandatory dual-lock acquisition with strict ordering for uninstall/run races** (feedback-5, rec 2) — rejected because the existing plan already has admin lock for destructive ops and pipeline lock for runs. The race window (scheduler fires during uninstall) is tiny, the consequence is benign (service runs, finds no manifest, exits 0), and mandatory lock ordering with dual acquisition adds significant complexity. The plan's existing separation (admin lock for state mutations, pipeline lock for data writes) is sufficient.

- **Decoupled optional stage cadence from core sync interval** (feedback-5, rec 4) — rejected because separate freshness windows per stage (e.g., "docs every 60m, embeddings every 6h") add significant complexity: new config fields per stage, last-success tracking per stage, skip logic, and confusing profile semantics. The existing profile system already solves this more simply: use `fast` for frequent intervals (issues+MRs only), `balanced` or `full` for less frequent intervals that include heavier stages.

- **Windows env-file parity via wrapper script** (feedback-5, rec 5) — rejected because Windows Task Scheduler has fundamentally different environment handling than launchd/systemd. A wrapper `.cmd` or `.ps1` script introduces fragility (quoting, encoding, UAC edge cases, PowerShell execution policy) for marginal benefit. The current `system_env` approach is honest, works reliably, and Windows users are accustomed to system environment variables. Future Credential Manager integration (already noted as deferred) is the right long-term solution.

- **`--regenerate` flag on service repair** (feedback-5, rec 7 partial) — rejected because `lore service install` is already idempotent (detects existing manifest, overwrites if config differs). Regenerating scheduler artifacts is exactly what a re-install does. Adding `--regenerate` to repair creates a confusing second path to the same outcome. The `spec_hash` drift detection (accepted from this rec) gives users clear diagnostics; the remedy is simply `lore service install`.