Files

Taylor Eernisse 2c9de1a6c3 docs: add lore-service, work-item-status-graphql, and time-decay plans

Three implementation plans with iterative cross-model refinement:

lore-service (5 iterations):
  HTTP service layer exposing lore's SQLite data via REST/SSE for
  integration with external tools (dashboards, IDE extensions, chat
  agents). Covers authentication, rate limiting, caching strategy, and
  webhook-driven sync triggers.

work-item-status-graphql (7 iterations + TDD appendix):
  Detailed implementation plan for the GraphQL-based work item status
  enrichment feature (now implemented). Includes the TDD appendix with
  test-first development specifications covering GraphQL client, adaptive
  pagination, ingestion orchestration, CLI display, and robot mode output.

time-decay-expert-scoring (iteration 5 feedback):
  Updates to the existing time-decay scoring plan incorporating feedback
  on decay curve parameterization, recency weighting for discussion
  contributions, and staleness detection thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-11 08:12:17 -05:00

163 KiB

Raw Blame History

plan, title, status, iteration, target_iterations, beads_revision, related_plans, created, updated

plan

title

status

iteration

target_iterations

beads_revision

related_plans

created

updated

true

iterating

2026-02-09

2026-02-11

Plan: `lore service` — OS-Native Scheduled Sync

Context

lore sync runs a 4-stage pipeline (issues, MRs, docs, embeddings) that takes 2-4 minutes. Today it must be invoked manually. We want lore service install to set up OS-native scheduled execution automatically, with exponential backoff on failures, a circuit breaker for persistent transient errors, stage-aware outcome tracking, and a status file for observability. This is the first nested subcommand in the project.

Key Design Principles

1. Separation of Manual and Scheduled Execution

lore sync remains the manual/operator command. It is never subject to backoff, pausing, or service-level policy. A separate hidden entrypoint — lore service run --service-id <id> — is what the OS scheduler actually invokes. This entrypoint applies service-specific policy (backoff, error classification, pipeline locking) before delegating to the sync pipeline. This separation ensures that a human running lore sync to debug or recover is never unexpectedly blocked by service state. The --service-id parameter ensures unambiguous manifest/status file selection when multiple services are installed.

2. Project-Scoped Service Identity

Each installed service gets a unique service_id derived from a canonical identity tuple: the workspace root, config file path, and sorted GitLab project URLs. This composite fingerprint prevents collisions even when multiple workspaces share a single global config file — the identity represents what is being synced and where, not just the config location. The hash uses 12 hex characters (48 bits) for collision safety. An optional --name flag allows explicit naming for human readability; if --name collides with an existing service that has a different identity hash (different workspace/config/projects), install fails with an actionable error listing the conflict.

3. Stage-Aware Outcome Tracking

The sync pipeline has stages of differing criticality. Issues and MRs are core — their failure constitutes a hard failure. Docs and embeddings are optional — their failure produces a degraded outcome but does not trigger backoff or pause. This ensures data freshness for the most important entities even when peripheral stages have transient problems.

4. Resilient Failure Handling

Errors are classified as transient (retry with backoff) or permanent (pause until user intervention). A circuit breaker trips after a configurable number of consecutive transient failures (default: 10), transitioning to a half_open probe state after a cooldown period (default: 30 minutes). In half_open, one trial run is allowed — if it succeeds, the breaker closes automatically; if it fails, the breaker returns to paused state requiring manual lore service resume. This provides self-healing for systemic but recoverable failures (DNS outages, temporary GitLab maintenance) while still halting on truly persistent problems.

5. Transactional Install

The install process is two-phase: service files are generated and the platform-specific enable command is run first. Only on success is the install manifest written atomically. If the enable command fails, generated files are cleaned up and no manifest is persisted. This prevents a false "installed" state when the scheduler rejects the service configuration.

6. Serialized Admin Mutations

All commands that mutate service state (install, uninstall, pause, resume, repair) acquire an admin-level lock — AppLock("service-admin-{service_id}") — before reading or writing manifest/status files. This prevents races between concurrent admin commands (e.g., a user running service pause while an automated tool runs service resume). The admin lock is separate from the sync_pipeline lock, which guards the data pipeline. Legal state transitions:

idle -> running -> success | degraded | backoff | paused
backoff -> running | paused
paused -> half_open | running (via resume)
half_open -> running | paused

Any transition not in this table is rejected with ServiceCorruptState. The service run entrypoint does NOT acquire the admin lock — it only acquires the sync_pipeline lock to avoid overlapping data writes.

Commands & User Journeys

`lore service install [--interval 30m] [--profile balanced] [--token-source env-file] [--name <optional>] [--dry-run]`

What it does: Generates and installs an OS-native scheduled task that runs lore --robot service run --service-id <service_id> at the specified interval, with the chosen sync profile, token storage strategy, and a project-scoped identity to avoid collisions across workspaces.

User journey:

User runs lore service install --interval 15m --profile fast
CLI loads config to read gitlab.tokenEnvVar (default: GITLAB_TOKEN)
CLI resolves the token value from the current environment
CLI computes or reads service_id:
- If --name is provided, use it (sanitized to [a-z0-9-])
- Otherwise, derive from a composite fingerprint of (workspace root + config path + sorted project URLs) — first 12 hex chars of SHA-256
- This becomes the suffix for all platform-specific identifiers (launchd label, systemd unit name, Windows task name)
CLI resolves its own binary path via std::env::current_exe()?.canonicalize()?
CLI writes the token to a user-private env file ({data_dir}/service-env-{service_id}, mode 0600) unless --token-source embedded is explicitly passed
CLI generates the platform-specific service files (referencing lore --robot service run --service-id <service_id>, NOT lore sync)
CLI writes service files to disk
CLI runs the platform-specific enable command
On success: CLI writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) to {data_dir}/service-manifest-{service_id}.json
On failure: CLI removes generated service files, env file, wrapper script, and temp manifest — returns ServiceCommandFailed with stderr context
CLI outputs success with details of what was installed

Sync profiles:

Profile	Sync flags	Use case
`fast`	`--no-docs --no-embed`	Minimal: issues + MRs only
`balanced` (default)	`--no-embed`	Issues + MRs + doc generation
`full`	(none)	Full pipeline including embeddings

The profile determines what flags are passed to the underlying sync command. The scheduler invocation is always lore --robot service run --service-id <service_id>, which reads the profile from the install manifest and constructs the appropriate sync flags.

Token storage strategies:

Strategy	Behavior	Security	Platforms
`env-file` (default)	Token written to `{data_dir}/service-env-{service_id}` with 0600 permissions. On Linux/systemd, referenced via `EnvironmentFile=` (true file-based loading). On macOS/launchd, a wrapper shell script (mode 0700) sources the env file at runtime and execs `lore` — the token never appears in the plist.	Token file only readable by owner. Canonical source is the env file; `lore service install` re-reads it on regeneration.	macOS, Linux
`embedded`	Token embedded directly in service file. Requires explicit `--token-source embedded` flag. CLI prints a security warning.	Less secure: token visible in plist/unit file.	macOS, Linux

On Windows, neither strategy applies — the token must be in the user's system environment (set via setx or system settings). token_source is reported as "system_env". This is documented as a requirement in lore service install output on Windows.

Note on macOS wrapper script approach: launchd cannot natively load environment files. Rather than embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script ({data_dir}/service-run-{service_id}.sh, mode 0700) that sources the env file and execs lore. The plist's ProgramArguments points to the wrapper script, keeping the token out of the plist entirely. On Linux/systemd, EnvironmentFile= provides native file-based loading without any wrapper needed.

Future enhancement: On macOS, Keychain integration could eliminate the env file entirely. On Windows, Credential Manager could replace the system environment requirement. These are deferred to a future iteration to avoid adding platform-specific secure store dependencies (security-framework, winapi) in v1.

Acceptance criteria:

Parses interval strings: 5m, 15m, 30m, 1h, 2h, 12h, 24h
Rejects intervals < 5 minutes or > 24 hours
Rejects non-numeric or malformed intervals with clear error messages
Computes service_id from composite fingerprint (workspace root + config path + project URLs) or --name flag; sanitizes to [a-z0-9-]. If --name collides with an existing service with a different identity hash, returns an actionable error.
If already installed (manifest exists for this service_id): reads existing manifest. If config matches, reports no_change: true. If config differs, overwrites and reports what changed.
If GITLAB_TOKEN (or configured env var) is not set, fails with TokenNotSet error
If current_exe() fails, returns ServiceError
Creates parent directories for service files if they don't exist
Writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) alongside service files
Runs service doctor checks as a pre-flight: validates scheduler prerequisites (e.g., systemd user manager/linger on Linux, GUI session context on macOS) and surfaces warnings or errors before installing
--dry-run: validates config/token/prereqs, renders service files and planned commands, but writes nothing and executes nothing. Robot output includes "dry_run": true and the rendered service file content for inspection.
Robot mode outputs {"ok":true,"data":{...},"meta":{"elapsed_ms":N}}
Human mode outputs a clear summary with file paths and next steps

Robot output:

{
  "ok": true,
  "data": {
    "platform": "launchd",
    "service_id": "a1b2c3d4e5f6",
    "interval_seconds": 900,
    "profile": "fast",
    "binary_path": "/usr/local/bin/lore",
    "config_path": null,
    "service_files": ["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist"],
    "sync_command": "/usr/local/bin/lore --robot service run --service-id a1b2c3d4e5f6",
    "token_env_var": "GITLAB_TOKEN",
    "token_source": "env_file",
    "no_change": false
  },
  "meta": { "elapsed_ms": 42 }
}

Human output:

Service installed:
  Platform:  launchd
  Service ID: a1b2c3d4e5f6
  Interval:  15m (900s)
  Profile:   fast (--no-docs --no-embed)
  Binary:    /usr/local/bin/lore
  Service:   ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
  Command:   lore --robot service run --service-id a1b2c3d4e5f6
  Token:     stored in ~/.local/share/lore/service-env-a1b2c3d4e5f6 (0600)

  To rotate your token: lore service install

`lore service list`

What it does: Lists all installed services discovered from {data_dir}/service-manifest-*.json files. Useful when managing multiple gitlore workspaces to see all active installations at a glance.

User journey:

User runs lore service list
CLI scans {data_dir} for files matching service-manifest-*.json
Reads each manifest and verifies platform state
Outputs summary of all installed services

Acceptance criteria:

Returns empty list (not error) when no services installed
Shows service_id, platform, interval, profile, installed_at_iso for each
Verifies platform state matches manifest (flags drift)
Robot and human output modes

Robot output:

{
  "ok": true,
  "data": {
    "services": [
      {
        "service_id": "a1b2c3d4e5f6",
        "platform": "launchd",
        "interval_seconds": 900,
        "profile": "fast",
        "installed_at_iso": "2026-02-09T10:00:00Z",
        "platform_state": "loaded",
        "drift": false
      }
    ]
  },
  "meta": { "elapsed_ms": 15 }
}

Human output:

Installed services:
  a1b2c3d4e5f6  launchd  15m  fast   installed 2026-02-09  loaded

Or when none installed:

No services installed. Run: lore service install

`lore service uninstall [--service <service_id|name>] [--all]`

What it does: Disables and removes the scheduled task, its manifest, and its token env file.

User journey:

User runs lore service uninstall
CLI resolves target service: uses --service if provided, otherwise derives service_id from current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services with lore service list.
If manifest doesn't exist, checks platform directly; if not installed, exits cleanly with informational message (exit 0, not an error)
Runs platform-specific disable command
Removes service files from disk
Removes install manifest (service-manifest-{service_id}.json)
Removes token env file (service-env-{service_id}) if it exists
Does NOT remove the status file or log files (those are operational data, not config)
Outputs confirmation

Acceptance criteria:

Idempotent: running when not installed is not an error
Removes ALL service files (timer + service on systemd), the install manifest, and the token env file
Does NOT remove the status file or log files (those are data, not config)
If platform disable command fails (e.g., service was already unloaded), still removes files and succeeds
Robot and human output modes

Robot output:

{
  "ok": true,
  "data": {
    "was_installed": true,
    "service_id": "a1b2c3d4e5f6",
    "platform": "launchd",
    "removed_files": [
      "/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist",
      "/Users/x/.local/share/lore/service-manifest-a1b2c3d4e5f6.json",
      "/Users/x/.local/share/lore/service-env-a1b2c3d4e5f6"
    ]
  },
  "meta": { "elapsed_ms": 15 }
}

Human output:

Service uninstalled (a1b2c3d4e5f6):
  Removed: ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
  Removed: ~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json
  Removed: ~/.local/share/lore/service-env-a1b2c3d4e5f6
  Kept:    ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json (run history)
  Kept:    ~/.local/share/lore/logs/ (service logs)

Or if not installed:

Service is not installed. Nothing to do.

`lore service status [--service <service_id|name>]`

What it does: Shows install state, scheduler state (running/backoff/paused/half_open/idle), last sync result, recent run history, and next run estimate. Resolves target service via --service flag or current-project-derived default.

User journey:

User runs lore service status
CLI resolves target service: uses --service if provided, otherwise derives service_id from current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services with lore service list.
CLI reads install manifest from {data_dir}/service-manifest-{service_id}.json
If installed, verifies platform state matches manifest (detects drift)
Reads {data_dir}/sync-status-{service_id}.json for last sync and recent run history
Queries platform for service state and next run time
Computes scheduler state from status file + backoff logic
Outputs combined status

Scheduler states:

idle — installed but no runs yet
running — currently executing (sync_pipeline lock held, current_run metadata present with recent started_at_ms)
running_stale — current_run metadata exists but the process (by PID) is no longer alive, or started_at_ms is older than 30 minutes. Indicates a crashed or killed previous run. lore service status reports this with the stale run's start time and PID for diagnostics.
degraded — last run completed but one or more optional stages failed (docs/embeddings). Core data (issues/MRs) is fresh.
backoff — transient failures, waiting to retry
half_open — circuit breaker cooldown expired; one probe run is allowed. If it succeeds, the breaker closes automatically and state returns to normal. If it fails, state transitions to paused.
paused — permanent error detected (bad token, config error) OR circuit breaker tripped and probe failed. Requires user intervention via lore service resume.
not_installed — service not installed

Acceptance criteria:

Works even if service is not installed (shows installed: false, scheduler_state: "not_installed")
Works even if status file doesn't exist (shows last_sync: null)
Shows backoff state with remaining time if in backoff
Shows paused reason if in paused state
Includes recent runs summary (last 5 runs)
Shows next scheduled run if determinable from platform
Detects drift at multiple levels:
- Platform drift: loaded/unloaded mismatch between manifest and OS scheduler
- Spec drift: SHA-256 hash of service file content on disk doesn't match spec_hash in manifest (detects manual edits to plist/unit files)
- Command drift: sync command in service file differs from manifest's sync_command
Exit code 0 always (status is informational)

Robot output:

{
  "ok": true,
  "data": {
    "installed": true,
    "service_id": "a1b2c3d4e5f6",
    "platform": "launchd",
    "interval_seconds": 1800,
    "profile": "balanced",
    "service_state": "loaded",
    "scheduler_state": "running",
    "last_sync": {
      "timestamp_iso": "2026-02-09T10:30:00.000Z",
      "duration_seconds": 12.5,
      "outcome": "success",
      "stage_results": [
        { "stage": "issues", "success": true, "items_updated": 5 },
        { "stage": "mrs", "success": true, "items_updated": 3 },
        { "stage": "docs", "success": true, "items_updated": 12 }
      ],
      "consecutive_failures": 0
    },
    "recent_runs": [
      { "timestamp_iso": "2026-02-09T10:30:00Z", "outcome": "success", "duration_seconds": 12.5 },
      { "timestamp_iso": "2026-02-09T10:00:00Z", "outcome": "success", "duration_seconds": 11.8 }
    ],
    "backoff": null,
    "paused_reason": null,
    "drift": {
      "platform_drift": false,
      "spec_drift": false,
      "command_drift": false
    }
  },
  "meta": { "elapsed_ms": 15 }
}

When degraded (optional stages failed):

"scheduler_state": "degraded",
"last_sync": {
  "outcome": "degraded",
  "stage_results": [
    { "stage": "issues", "success": true, "items_updated": 5 },
    { "stage": "mrs", "success": true, "items_updated": 3 },
    { "stage": "docs", "success": false, "error": "I/O error writing documents" }
  ]
}

When in backoff:

"scheduler_state": "backoff",
"backoff": {
  "consecutive_failures": 3,
  "next_retry_iso": "2026-02-09T14:30:00.000Z",
  "remaining_seconds": 7200
}

When paused (permanent error):

"scheduler_state": "paused",
"paused_reason": "AUTH_FAILED: GitLab returned 401 Unauthorized. Run: lore service resume"

When paused (circuit breaker):

"scheduler_state": "paused",
"paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures (last: NetworkError). Run: lore service resume"

When in half-open (circuit breaker cooldown expired, probe pending):

"scheduler_state": "half_open",
"backoff": {
  "consecutive_failures": 10,
  "circuit_breaker_cooldown_expired": true,
  "message": "Circuit breaker cooldown expired. Next run will be a probe attempt."
}

Human output:

Service status (a1b2c3d4e5f6):
  Installed:   yes
  Platform:    launchd
  Interval:    30m (1800s)
  Profile:     balanced
  State:       loaded
  Scheduler:   running

Last sync:
  Time:        2026-02-09 10:30:00 UTC
  Duration:    12.5s
  Outcome:     success
  Stages:      issues (5), mrs (3), docs (12)
  Failures:    0 consecutive

Recent runs (last 5):
  10:30 UTC  success   12.5s
  10:00 UTC  success   11.8s

When degraded:

  Scheduler:   DEGRADED
               Core stages OK: issues (5), mrs (3)
               Failed stages: docs (I/O error writing documents)
               Core data is fresh. Optional stages will retry next run.

When paused (permanent error):

  Scheduler:   PAUSED - AUTH_FAILED
               GitLab returned 401 Unauthorized
               Fix: rotate token, then run: lore service resume

When paused (circuit breaker):

  Scheduler:   PAUSED - CIRCUIT_BREAKER
               10 consecutive transient failures (last: NetworkError)
               Fix: check network/GitLab availability, then run: lore service resume

When half-open (circuit breaker cooldown expired):

  Scheduler:   HALF_OPEN
               Circuit breaker cooldown expired. Next run will probe.
               If probe succeeds, scheduler returns to normal.

`lore service logs [--tail <n>] [--follow] [--open] [--service <service_id|name>]`

What it does: Displays or streams the service log file. By default, prints the last 100 lines to stdout. With --tail <n>, shows the last N lines. With --follow, streams new lines as they arrive (like tail -f). With --open, opens in the user's preferred editor.

User journey (default):

User runs lore service logs
CLI determines log path: {data_dir}/logs/service-{service_id}-stderr.log
CLI checks if file exists; if not, outputs "No log file found yet" with the expected path
Prints last 100 lines to stdout

User journey (--open):

User runs lore service logs --open
CLI determines editor: $VISUAL -> $EDITOR -> less (Unix) / notepad (Windows)
Spawns editor as child process, waits for exit
Exits with editor's exit code

User journey (--tail / --follow):

User runs lore service logs --tail 50 or lore service logs --follow
CLI reads the last N lines or streams with follow
Outputs directly to stdout

Log rotation: Rotate service-{service_id}-stdout.log and service-{service_id}-stderr.log at 10 MB, keeping 5 rotated files. Rotation is checked at 10 MB, not at every write. This avoids creating many small files and prevents log file explosion.

Acceptance criteria:

Default (no flags): prints last 100 lines to stdout
--open: Falls back through VISUAL -> EDITOR -> less -> notepad. If no editor and no less available, returns ServiceError with suggestion.
--tail <n> shows last N lines (default 100 if no value), exits immediately
--follow streams new log lines until Ctrl-C (like tail -f); mutually exclusive with --open
--tail and --follow can be combined: show last N lines then follow
In robot mode, outputs the log file path and optionally last N lines as JSON (never opens editor)

Robot output (does not open editor):

{
  "ok": true,
  "data": {
    "log_path": "/Users/x/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log",
    "exists": true,
    "size_bytes": 4096,
    "last_lines": ["2026-02-09T10:30:00Z sync completed in 12.5s", "..."]
  },
  "meta": { "elapsed_ms": 1 }
}

The last_lines field is included when --tail is specified in robot mode (capped at 100 lines to avoid bloated JSON). Without --tail, only path metadata is returned. --follow is not supported in robot mode (returns error: "follow mode requires interactive terminal").

`lore service doctor`

What it does: Validates that the service environment is healthy: scheduler prerequisites, token validity, file permissions, config accessibility, and platform-specific readiness.

User journey:

User runs lore service doctor (or it runs automatically as a pre-flight during service install)
CLI runs a series of diagnostic checks and reports pass/warn/fail for each

Diagnostic checks:

Config accessible — Can load and parse config.json
Token present — Configured env var is set and non-empty
Token valid — Quick auth test against GitLab API (optional, skipped with --offline)
Binary path — current_exe() resolves and is executable
Data directory — Writable by current user
Platform prerequisites:
- macOS: Running in a GUI login session (launchd bootstrap domain is gui/{uid}, not system)
- Linux: systemctl --user is available; user manager is running; loginctl enable-linger is active (required for timers to fire when user is not logged in)
- Windows: schtasks is available
Existing install — If manifest exists, verify platform state matches (drift detection)

Acceptance criteria:

Each check reports: pass, warn, or fail
Warnings are non-blocking (e.g., linger not enabled — timer works when logged in but not on reboot)
Failures are blocking for service install (install aborts with actionable message)
--offline skips network checks (token validation)
--fix attempts safe, non-destructive remediations for fixable issues: create missing directories, correct file permissions on env/wrapper files (0600/0700), run systemctl --user daemon-reload when applicable. Reports each applied fix in the output. Does NOT attempt fixes that could cause data loss.
Exit code: 0 if all pass/warn, non-zero if any fail

Robot output:

{
  "ok": true,
  "data": {
    "checks": [
      { "name": "config", "status": "pass" },
      { "name": "token_present", "status": "pass" },
      { "name": "token_valid", "status": "pass" },
      { "name": "binary_path", "status": "pass" },
      { "name": "data_directory", "status": "pass" },
      { "name": "platform_prerequisites", "status": "warn", "message": "loginctl linger not enabled; timer will not fire on reboot without active session", "action": "loginctl enable-linger $(whoami)" },
      { "name": "install_state", "status": "pass" }
    ],
    "overall": "warn"
  },
  "meta": { "elapsed_ms": 850 }
}

Human output:

Service doctor:
  [PASS] Config loaded from ~/.config/lore/config.json
  [PASS] GITLAB_TOKEN is set
  [PASS] GitLab authentication successful
  [PASS] Binary: /usr/local/bin/lore
  [PASS] Data dir: ~/.local/share/lore/ (writable)
  [WARN] loginctl linger not enabled
         Timer will not fire on reboot without active session
         Fix: loginctl enable-linger $(whoami)
  [PASS] No existing install detected

  Overall: WARN (1 warning)

`lore service run` (hidden/internal)

What it does: Executes one scheduled sync attempt with full service-level policy. This is the command the OS scheduler actually invokes — users should never need to call it directly.

Invocation by scheduler: lore --robot service run --service-id <service_id>

Execution flow:

Read install manifest for the given service_id to determine profile, interval, and circuit breaker config
Read status file (service-scoped)
If paused (not half_open): check if circuit breaker cooldown has expired. If cooldown expired, transition to half_open and allow probe (continue to step 5). If cooldown still active or paused for permanent error, log reason, write status, exit 0.
If in backoff window: log skip reason, write status, exit 0
Acquire sync_pipeline AppLock (prevents overlap with manual sync or another scheduled run)
If lock acquisition fails (another sync running): log, exit 0
Execute sync pipeline with flags derived from profile
On success: reset consecutive_failures to 0, write status, release lock
On transient failure: increment consecutive_failures, compute next backoff, write status, release lock
On permanent failure: set paused_reason, write status, release lock

Stage-aware execution:

The sync pipeline is executed stage-by-stage, with each stage's outcome recorded independently:

Stage	Criticality	Failure behavior	In-run retry
`issues`	core	Hard failure — triggers backoff/pause	1 retry on transient errors (1-5s jittered delay)
`mrs`	core	Hard failure — triggers backoff/pause	1 retry on transient errors (1-5s jittered delay)
`docs`	optional	Degraded outcome — logged but does not trigger backoff	No retry (best-effort)
`embeddings`	optional	Degraded outcome — logged but does not trigger backoff	No retry (best-effort)

In-run retries for core stages: Before counting a core stage failure toward backoff/circuit-breaker, the service runner retries the stage once with a jittered delay of 1-5 seconds. This absorbs transient network blips (DNS hiccups, momentary 5xx responses) without extending run duration significantly. Only transient errors are retried — permanent errors (bad token, config errors) are never retried. If the retry succeeds, the stage is recorded as successful. If both attempts fail, the final error is used for classification. This significantly reduces false backoff triggers from brief network interruptions.

If all core stages succeed (potentially after retry) but optional stages fail, the run outcome is "degraded" — consecutive failures are NOT incremented, and the scheduler state reflects degraded rather than backoff. This ensures data freshness for the most important entities even when peripheral stages have transient problems.

Transient vs permanent error classification:

Error type	Classification	Examples
Transient	Retry with backoff	Network timeout, DB locked, 5xx from GitLab
Transient (hinted)	Respect server retry hint	Rate limited with `Retry-After` or `X-RateLimit-Reset` header
Permanent	Pause until user action	401 Unauthorized (bad token), config not found, config invalid, migration failed

The classification is determined by the ErrorCode of the underlying LoreError:

Permanent: TokenNotSet, AuthFailed, ConfigNotFound, ConfigInvalid, MigrationFailed
Transient: everything else (NetworkError, RateLimited, DbLocked, DbError, InternalError, etc.)

Key design decisions:

next_retry_at_ms is computed once on failure and persisted — service status simply reads it for stable, consistent display
Retry-After awareness: If a transient error includes a server-provided retry hint (e.g., Retry-After header on 429 responses, X-RateLimit-Reset on GitLab rate limits), the backoff is set to max(computed_backoff, hinted_retry_at). This prevents useless retries during rate-limit windows and respects GitLab's guidance. The backoff_reason field (if present) indicates whether the backoff was server-hinted.
Backoff base is the configured interval, not a hardcoded 1800s — a user with --interval 5m gets shorter backoffs than one with --interval 1h
Optional stage failures produce degraded outcome without triggering backoff
Respects backoff window from previous failures (reads next_retry_at_ms from status file)
Pauses on permanent errors instead of burning retries
Trips circuit breaker after 10 consecutive transient failures
Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)

Circuit breaker (with half-open recovery):

After max_transient_failures consecutive transient failures (default: 10), the service transitions to paused state with reason CIRCUIT_BREAKER. However, instead of requiring manual intervention forever, the circuit breaker enters a half_open state after a cooldown period (circuit_breaker_cooldown_seconds, default: 1800 = 30 minutes).

In half_open, the next service run invocation is allowed to proceed as a probe:

If the probe succeeds or returns degraded, the circuit breaker closes automatically: consecutive_failures resets to 0, paused_reason is cleared, and normal operation resumes.
If the probe fails, the circuit breaker returns to paused state with an updated circuit_breaker_paused_at_ms timestamp, starting another cooldown period.

This provides self-healing for recoverable systemic failures (DNS outages, GitLab maintenance windows) without requiring manual lore service resume for every transient hiccup. Truly persistent problems (bad token, config corruption) are caught by the permanent error classifier and go directly to paused without the half-open mechanism.

The circuit_breaker_cooldown_seconds is stored in the manifest alongside max_transient_failures. Both are hardcoded defaults for v1 (10 failures, 30-minute cooldown) but can be made configurable in a future iteration.

Acceptance criteria:

Hidden from --help (use #[command(hide = true)])
Always runs in robot mode regardless of --robot flag
Acquires pipeline-level lock before executing sync
Executes stages independently and records per-stage outcomes
Retries transient core stage failures once (1-5s jittered delay) before counting as failed
Permanent core stage errors are never retried — immediate pause
Classifies core stage errors as transient or permanent
Optional stage failures produce degraded outcome without triggering backoff
Respects backoff window from previous failures (reads next_retry_at_ms from status file)
Pauses on permanent errors instead of burning retries
Trips circuit breaker after 10 consecutive transient failures
Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)

Robot output (success):

{
  "ok": true,
  "data": {
    "action": "sync_completed",
    "outcome": "success",
    "profile": "balanced",
    "duration_seconds": 45.2,
    "stage_results": [
      { "stage": "issues", "success": true, "items_updated": 12 },
      { "stage": "mrs", "success": true, "items_updated": 4 },
      { "stage": "docs", "success": true, "items_updated": 28 }
    ],
    "consecutive_failures": 0
  },
  "meta": { "elapsed_ms": 45200 }
}

Robot output (degraded — optional stages failed):

{
  "ok": true,
  "data": {
    "action": "sync_completed",
    "outcome": "degraded",
    "profile": "full",
    "duration_seconds": 38.1,
    "stage_results": [
      { "stage": "issues", "success": true, "items_updated": 12 },
      { "stage": "mrs", "success": true, "items_updated": 4 },
      { "stage": "docs", "success": true, "items_updated": 28 },
      { "stage": "embeddings", "success": false, "error": "Ollama unavailable" }
    ],
    "consecutive_failures": 0
  },
  "meta": { "elapsed_ms": 38100 }
}

Robot output (skipped — backoff):

{
  "ok": true,
  "data": {
    "action": "skipped",
    "reason": "backoff",
    "consecutive_failures": 3,
    "next_retry_iso": "2026-02-09T14:30:00.000Z",
    "remaining_seconds": 1842
  },
  "meta": { "elapsed_ms": 1 }
}

Robot output (paused — permanent error):

{
  "ok": true,
  "data": {
    "action": "paused",
    "reason": "AUTH_FAILED",
    "message": "GitLab returned 401 Unauthorized",
    "suggestion": "Rotate token, then run: lore service resume"
  },
  "meta": { "elapsed_ms": 1200 }
}

Robot output (paused — circuit breaker):

{
  "ok": true,
  "data": {
    "action": "paused",
    "reason": "CIRCUIT_BREAKER",
    "message": "10 consecutive transient failures (last: NetworkError: connection refused)",
    "consecutive_failures": 10,
    "suggestion": "Check network/GitLab availability, then run: lore service resume"
  },
  "meta": { "elapsed_ms": 1200 }
}

`lore service resume [--service <service_id|name>]`

What it does: Clears the paused state (including half-open circuit breaker) and resets consecutive failures, allowing the scheduler to retry on the next interval.

User journey:

User sees lore service status reports scheduler: PAUSED
User fixes the underlying issue (rotates token, fixes config, etc.)
User runs lore service resume
CLI resets consecutive_failures to 0, clears paused_reason and last_error_* fields
Next scheduled service run will attempt sync normally

Acceptance criteria:

If not paused, exits cleanly with informational message ("Service is not paused")
If not installed, exits cleanly with informational message ("Service is not installed")
Does NOT trigger an immediate sync (just clears state — scheduler handles the next run)
Robot and human output modes

Robot output:

{
  "ok": true,
  "data": {
    "was_paused": true,
    "previous_reason": "AUTH_FAILED",
    "consecutive_failures_cleared": 5
  },
  "meta": { "elapsed_ms": 2 }
}

Human output:

Service resumed:
  Previous state: PAUSED (AUTH_FAILED)
  Failures cleared: 5
  Next sync will run at the scheduled interval.

Or for circuit breaker:

Service resumed:
  Previous state: PAUSED (CIRCUIT_BREAKER, 10 transient failures)
  Failures cleared: 10
  Next sync will run at the scheduled interval.

`lore service pause [--reason <text>] [--service <service_id|name>]`

What it does: Pauses scheduled execution without uninstalling the service. Useful for maintenance windows, debugging, or temporarily stopping syncs while the underlying infrastructure is being modified.

User journey:

User runs lore service pause --reason "GitLab maintenance window"
CLI writes paused_reason to the status file with the provided reason (or "Manually paused" if no reason given)
Next service run will see the paused state and exit immediately

Acceptance criteria:

Sets paused_reason in the status file
Does NOT modify the OS scheduler (service remains installed and scheduled — it just no-ops)
If already paused, updates the reason and reports already_paused: true
lore service resume clears the pause (same as for other paused states)
Robot and human output modes

Robot output:

{
  "ok": true,
  "data": {
    "service_id": "a1b2c3d4e5f6",
    "paused": true,
    "reason": "GitLab maintenance window",
    "already_paused": false
  },
  "meta": { "elapsed_ms": 2 }
}

Human output:

Service paused (a1b2c3d4e5f6):
  Reason: GitLab maintenance window
  Resume with: lore service resume

`lore service trigger [--ignore-backoff] [--service <service_id|name>]`

What it does: Triggers an immediate one-off sync using the installed service profile and policy. Unlike running lore sync manually, this goes through the service policy layer (status file, stage-aware outcomes, error classification) — giving you the same behavior the scheduler would produce, but on-demand.

User journey:

User runs lore service trigger
CLI reads the manifest to determine profile
By default, respects current backoff/paused state (reports skip reason if blocked)
With --ignore-backoff, bypasses backoff window (but NOT paused state — use resume for that)
Executes handle_service_run logic
Updates status file with the run result

Acceptance criteria:

Uses the installed profile from the manifest
Default: respects backoff and paused states
--ignore-backoff: bypasses backoff window, still respects paused
If not installed, returns actionable error
Robot and human output modes (same format as service run output)

`lore service repair [--service <service_id|name>]`

What it does: Repairs corrupt manifest or status files by backing them up and reinitializing. This is a safe alternative to manually deleting files and reinstalling.

User journey:

User runs lore service repair (typically after seeing ServiceCorruptState errors)
CLI checks manifest and status files for JSON parseability
If corrupt: renames the corrupt file to {name}.corrupt.{timestamp} (backup, not delete)
Reinitializes the status file to default state
If manifest is corrupt, reports that reinstallation is needed
Outputs what was repaired

Acceptance criteria:

Never deletes files — backs up corrupt files with .corrupt.{timestamp} suffix
If both files are valid, reports "No repair needed" (exit 0)
If manifest is corrupt, clears it and advises lore service install
If status file is corrupt, reinitializes to default
Robot and human output modes

Robot output:

{
  "ok": true,
  "data": {
    "repaired": true,
    "actions": [
      { "file": "sync-status-a1b2c3d4e5f6.json", "action": "reinitialized", "backup": "sync-status-a1b2c3d4e5f6.json.corrupt.1707480000" }
    ],
    "needs_reinstall": false
  },
  "meta": { "elapsed_ms": 5 }
}

Human output:

Service repaired (a1b2c3d4e5f6):
  Reinitialized: sync-status-a1b2c3d4e5f6.json
  Backed up:     sync-status-a1b2c3d4e5f6.json.corrupt.1707480000

Install Manifest

Location

{get_data_dir()}/service-manifest-{service_id}.json — e.g., ~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json

Purpose

Avoids brittle parsing of platform-specific files (plist XML, systemd units) to recover install configuration. service status reads the manifest first, then verifies platform state matches. The service_id suffix enables multiple coexisting installations for different workspaces.

Schema

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServiceManifest {
    /// Schema version for forward compatibility (start at 1)
    pub schema_version: u32,
    /// Stable identity for this service installation
    pub service_id: String,
    /// Canonical workspace root used in identity derivation
    pub workspace_root: String,
    /// When the service was first installed
    pub installed_at_iso: String,
    /// When the manifest was last written
    pub updated_at_iso: String,
    /// Platform backend
    pub platform: String,
    /// Configured interval in seconds
    pub interval_seconds: u64,
    /// Sync profile (fast/balanced/full)
    pub profile: String,
    /// Absolute path to the lore binary
    pub binary_path: String,
    /// Optional config path override
    #[serde(skip_serializing_if = "Option::is_none")]
    pub config_path: Option<String>,
    /// How the token is stored
    pub token_source: String,
    /// Token environment variable name
    pub token_env_var: String,
    /// Paths to generated service files
    pub service_files: Vec<String>,
    /// The exact command the scheduler runs
    pub sync_command: String,
    /// Circuit breaker threshold (consecutive transient failures before pause)
    pub max_transient_failures: u32,
    /// Cooldown period before circuit breaker enters half-open probe state (seconds)
    pub circuit_breaker_cooldown_seconds: u64,
    /// SHA-256 hash of generated scheduler artifacts (plist/unit/wrapper content).
    /// Used for spec-level drift detection: if file content on disk doesn't match
    /// this hash, something external modified the service files.
    pub spec_hash: String,
}

`service_id` derivation

/// Compute a stable service ID from a canonical identity tuple:
/// (workspace_root + config_path + sorted project URLs).
///
/// This avoids collisions when multiple workspaces share one global config
/// by incorporating what is being synced (project URLs) and where the workspace
/// lives alongside the config location.
/// Returns first 12 hex chars of SHA-256 (48 bits — collision-safe for local use).
pub fn compute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String {
    use sha2::{Sha256, Digest};
    let canonical_config = config_path.canonicalize()
        .unwrap_or_else(|_| config_path.to_path_buf());
    let canonical_workspace = workspace_root.canonicalize()
        .unwrap_or_else(|_| workspace_root.to_path_buf());
    let mut hasher = Sha256::new();
    hasher.update(canonical_workspace.to_string_lossy().as_bytes());
    hasher.update(b"\0");
    hasher.update(canonical_config.to_string_lossy().as_bytes());
    // Sort URLs for determinism regardless of config ordering
    let mut urls: Vec<&str> = project_urls.to_vec();
    urls.sort_unstable();
    for url in &urls {
        hasher.update(b"\0"); // separator to prevent concatenation collisions
        hasher.update(url.as_bytes());
    }
    let hash = hasher.finalize();
    hex::encode(&hash[..6]) // 12 hex chars
}

/// Sanitize a user-provided name to [a-z0-9-], max 32 chars.
pub fn sanitize_service_name(name: &str) -> Result<String, String> {
    let sanitized: String = name.to_lowercase()
        .chars()
        .map(|c| if c.is_ascii_alphanumeric() || c == '-' { c } else { '-' })
        .collect();
    let trimmed = sanitized.trim_matches('-').to_string();
    if trimmed.is_empty() {
        return Err("Service name must contain at least one alphanumeric character".into());
    }
    if trimmed.len() > 32 {
        return Err("Service name must be 32 characters or fewer".into());
    }
    Ok(trimmed)
}

Read/Write

ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError> — returns Ok(None) if file doesn't exist, Err if file exists but is corrupt/unparseable (distinguishes missing from corrupt). Schema migration: If the file has schema_version < CURRENT_VERSION, the read method migrates the in-memory model to the current version (adding default values for new fields) and atomically rewrites the file. If the file has an unknown future schema_version (higher than current), it returns Err(ServiceCorruptState) with an actionable message to update lore.
ServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()> — writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed.
Written by service install, read by service status, service run, service uninstall
service uninstall removes the manifest file

Status File

Location

{get_data_dir()}/sync-status-{service_id}.json — e.g., ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json

Add get_service_status_path(service_id: &str) to src/core/paths.rs.

Service-scoped status: Each installed service gets its own status file, keyed by service_id. This prevents cross-service contamination — a fast profile service pausing due to transient errors should not affect a full profile service's state. The pipeline lock remains global (sync_pipeline) to prevent overlapping writes to the shared database.

Schema

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncStatusFile {
    /// Schema version for forward compatibility (start at 1)
    pub schema_version: u32,
    /// When this status file was last written
    pub updated_at_iso: String,
    /// Most recent run result (None if no runs yet — matches idle state)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_run: Option<SyncRunRecord>,
    /// Rolling window of recent runs (last 10, newest first)
    #[serde(default)]
    pub recent_runs: Vec<SyncRunRecord>,
    /// Count of consecutive failures (resets to 0 on success or degraded outcome)
    pub consecutive_failures: u32,
    /// Persisted next retry time (set on failure, cleared on success/resume).
    /// Computed once at failure time with jitter, then read-only comparison afterward.
    /// This avoids recomputing jitter on every status check.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub next_retry_at_ms: Option<i64>,
    /// If set, service is paused due to a permanent error or circuit breaker
    #[serde(skip_serializing_if = "Option::is_none")]
    pub paused_reason: Option<String>,
    /// Timestamp when circuit breaker entered paused state (for cooldown calculation)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub circuit_breaker_paused_at_ms: Option<i64>,
    /// Error code that caused the pause (for machine consumption)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_error_code: Option<String>,
    /// Error message from last failure
    #[serde(skip_serializing_if = "Option::is_none")]
    pub last_error_message: Option<String>,
    /// In-flight run metadata for crash/stale detection. Written to the status file at run start,
    /// cleared on completion (success or failure). If present when a new run starts, the previous
    /// run crashed or was killed.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub current_run: Option<CurrentRunState>,
}

/// Metadata for an in-flight sync run. Used to detect stale/crashed runs.
/// Written to the status file at run start, cleared on completion (success or failure).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CurrentRunState {
    /// Unix timestamp (ms) when this run started
    pub started_at_ms: i64,
    /// PID of the process executing this run
    pub pid: u32,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncRunRecord {
    /// ISO-8601 timestamp of this sync run
    pub timestamp_iso: String,
    /// Unix timestamp in milliseconds
    pub timestamp_ms: i64,
    /// How long the sync took
    pub duration_seconds: f64,
    /// Run outcome: "success", "degraded", or "failed"
    pub outcome: String,
    /// Per-stage results (only present in detailed records, not in recent_runs summary)
    #[serde(default, skip_serializing_if = "Vec::is_empty")]
    pub stage_results: Vec<StageResult>,
    /// Error message if sync failed (None on success/degraded)
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error_message: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StageResult {
    /// Stage name: "issues", "mrs", "docs", "embeddings"
    pub stage: String,
    /// Whether this stage completed successfully
    pub success: bool,
    /// Number of items created/updated (0 on failure)
    #[serde(default)]
    pub items_updated: usize,
    /// Error message if stage failed
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error: Option<String>,
    /// Machine-readable error code from the underlying LoreError (e.g., "AUTH_FAILED", "NETWORK_ERROR").
    /// Propagated through the stage execution layer for reliable error classification.
    /// Falls back to string matching on `error` field when not available.
    #[serde(skip_serializing_if = "Option::is_none")]
    pub error_code: Option<String>,
}

Read/Write

SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError> — returns Ok(None) if file doesn't exist, Err if file exists but is corrupt/unparseable (distinguishes missing from corrupt — a corrupt status file is a warning, not fatal). Schema migration: Same behavior as ServiceManifest::read — migrates older versions to current, rejects unknown future versions.
SyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()> — writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed. Atomic writes prevent truncated JSON from crashes during write.
SyncStatusFile::record_run(&mut self, run: SyncRunRecord) — pushes to recent_runs (capped at 10), updates last_run
SyncStatusFile::clear_paused(&mut self) — clears paused_reason, circuit_breaker_paused_at_ms, last_error_*, next_retry_at_ms, resets consecutive_failures
File is NOT a fatal error source — if write fails, log a warning and continue (sync result matters more than recording it)

Backoff Logic

Backoff applies only to transient errors and only within service run. Manual lore sync is never subject to backoff. Permanent errors bypass backoff entirely and enter the paused state.

Key design change (from feedback): Instead of recomputing jitter on every service status / service run check, we compute next_retry_at_ms once at failure time and persist it. This makes status output stable, avoids predictable jitter from timestamp-seeded determinism, and simplifies the read path to a single comparison.

/// Injectable time source for deterministic testing.
pub trait Clock: Send + Sync {
    fn now_ms(&self) -> i64;
}

/// Production clock using chrono.
pub struct SystemClock;
impl Clock for SystemClock {
    fn now_ms(&self) -> i64 {
        chrono::Utc::now().timestamp_millis()
    }
}

/// Injectable RNG for deterministic jitter tests.
pub trait JitterRng: Send + Sync {
    /// Returns a value in [0.0, 1.0)
    fn next_f64(&mut self) -> f64;
}

/// Production RNG using thread_rng.
pub struct ThreadJitterRng;
impl JitterRng for ThreadJitterRng {
    fn next_f64(&mut self) -> f64 {
        use rand::Rng;
        rand::thread_rng().gen()
    }
}

impl SyncStatusFile {
    /// Check if we're still in a backoff window.
    /// Returns None if sync should proceed.
    /// Returns Some(remaining_seconds) if within backoff window.
    /// Reads the persisted `next_retry_at_ms` — no jitter computation on the read path.
    pub fn backoff_remaining(&self, clock: &dyn Clock) -> Option<u64> {
        // Paused state is handled separately (not via backoff)
        if self.paused_reason.is_some() {
            return None; // caller checks paused_reason directly
        }

        if self.consecutive_failures == 0 {
            return None;
        }

        let next_retry = self.next_retry_at_ms?;
        let now_ms = clock.now_ms();

        if now_ms < next_retry {
            Some(((next_retry - now_ms) / 1000) as u64)
        } else {
            None // backoff expired, proceed
        }
    }

    /// Compute and set next_retry_at_ms after a transient failure.
    /// Called once at failure time — jitter is applied here, not on reads.
    /// Uses the *configured* interval as the backoff base (not a hardcoded value).
    /// If the server provided a retry hint (e.g., Retry-After header), it is
    /// respected as a floor: next_retry_at_ms = max(computed_backoff, hint).
    pub fn set_backoff(
        &mut self,
        base_interval_seconds: u64,
        clock: &dyn Clock,
        rng: &mut dyn JitterRng,
        retry_after_ms: Option<i64>,
    ) {
        let exponent = (self.consecutive_failures - 1).min(20); // prevent overflow
        let base_backoff = (base_interval_seconds as u128)
            .saturating_mul(1u128 << exponent)
            .min(4 * 3600) as u64; // cap at 4 hours

        // Full jitter: uniform random in [base_interval..cap]
        // This decorrelates retries across multiple installations while ensuring
        // the minimum backoff is always at least the configured interval.
        let jitter_factor = rng.next_f64(); // 0.0..1.0
        let min_backoff = base_interval_seconds;
        let span = base_backoff.saturating_sub(min_backoff);
        let backoff_secs = min_backoff + ((span as f64) * jitter_factor) as u64;

        let computed_retry_at = clock.now_ms() + (backoff_secs as i64 * 1000);

        // Respect server-provided retry hint as a floor
        self.next_retry_at_ms = Some(match retry_after_ms {
            Some(hint) => computed_retry_at.max(hint),
            None => computed_retry_at,
        });
    }
}

Key design decisions:

next_retry_at_ms is computed once on failure and persisted — service status simply reads it for stable, consistent display
Backoff base is the configured interval, not a hardcoded 1800s — a user with --interval 5m gets shorter backoffs than one with --interval 1h
Full jitter (random in [base_interval..cap]) decorrelates retries across multiple installations, avoiding thundering herd
Injectable JitterRng trait enables deterministic testing without seeding from timestamps
Paused state is checked separately from backoff — they are orthogonal concerns
next_retry_at_ms is cleared on success and on service resume

Backoff examples with 30m (1800s) base interval:

consecutive_failures	max_backoff_seconds	human-readable range
1	1800	30 min (jittered within [30m, 30m])
2	3600	up to 1 hour (min 30m)
3	7200	up to 2 hours (min 30m)
4	14400	up to 4 hours (capped, min 30m)
5-9	14400	up to 4 hours (capped, min 30m)
10	—	circuit breaker trips → paused

Service Run Implementation (`handle_service_run`)

Critical: Backoff, error classification, circuit breaker, stage-aware execution, and status file management live only in handle_service_run. The manual handle_sync_cmd is NOT modified — it does not read or write the service status file.

Location: `src/cli/commands/service/run.rs`

pub fn handle_service_run(service_id: &str, start: std::time::Instant) -> Result<(), Box<dyn std::error::Error>> {
    let clock = SystemClock;
    let mut rng = ThreadJitterRng;

    // 1. Read manifest for the given service_id
    let manifest_path = lore::core::paths::get_service_manifest_path(service_id);
    let manifest = ServiceManifest::read(&manifest_path)?
        .ok_or_else(|| LoreError::ServiceError {
            message: format!("Service manifest not found for service_id '{service_id}'. Is the service installed?"),
        })?;

    // 2. Read status file (service-scoped)
    let status_path = lore::core::paths::get_service_status_path(&manifest.service_id);
    let mut status = match SyncStatusFile::read(&status_path) {
        Ok(Some(s)) => s,
        Ok(None) => SyncStatusFile::default(),
        Err(e) => {
            tracing::warn!(error = %e, "Corrupt status file, starting fresh");
            SyncStatusFile::default()
        }
    };

    // 3. Check paused state (permanent error or circuit breaker)
    if let Some(reason) = &status.paused_reason {
        // Check for circuit breaker half-open transition
        let is_circuit_breaker = reason.starts_with("CIRCUIT_BREAKER");
        let half_open = is_circuit_breaker
            && status.circuit_breaker_paused_at_ms.map_or(false, |paused_at| {
                let cooldown_ms = (manifest.circuit_breaker_cooldown_seconds as i64) * 1000;
                clock.now_ms() >= paused_at + cooldown_ms
            });

        if half_open {
            // Cooldown expired — allow probe run (continue to step 5)
            tracing::info!("Circuit breaker half-open: allowing probe run");
        } else {
            print_robot_json(json!({
                "ok": true,
                "data": {
                    "action": "paused",
                    "reason": reason,
                    "suggestion": if is_circuit_breaker {
                        format!("Waiting for cooldown ({}s). Or run: lore service resume",
                            manifest.circuit_breaker_cooldown_seconds)
                    } else {
                        "Fix the issue, then run: lore service resume".to_string()
                    }
                },
                "meta": { "elapsed_ms": start.elapsed().as_millis() }
            }));
            return Ok(());
        }
    }

    // 4. Check backoff (reads persisted next_retry_at_ms — no jitter computation)
    if let Some(remaining) = status.backoff_remaining(&clock) {
        print_robot_json(json!({
            "ok": true,
            "data": {
                "action": "skipped",
                "reason": "backoff",
                "consecutive_failures": status.consecutive_failures,
                "next_retry_iso": status.next_retry_at_ms.map(|ms| {
                    chrono::DateTime::from_timestamp_millis(ms)
                        .map(|dt| dt.to_rfc3339())
                }),
                "remaining_seconds": remaining,
            },
            "meta": { "elapsed_ms": start.elapsed().as_millis() }
        }));
        return Ok(());
    }

    // 5. Acquire pipeline lock
    let lock = match AppLock::try_acquire("sync_pipeline", stale_minutes) {
        Ok(lock) => lock,
        Err(_) => {
            print_robot_json(json!({
                "ok": true,
                "data": { "action": "skipped", "reason": "locked" },
                "meta": { "elapsed_ms": start.elapsed().as_millis() }
            }));
            return Ok(());
        }
    };

    // 6. Write current_run metadata for stale-run detection
    status.current_run = Some(CurrentRunState {
        started_at_ms: clock.now_ms(),
        pid: std::process::id(),
    });
    let _ = status.write_atomic(&status_path); // best-effort

    // 7. Build sync args from profile
    let sync_args = manifest.profile_to_sync_args();

    // 8. Execute sync pipeline stage-by-stage
    let stage_results = execute_sync_stages(&sync_args);

    // 8. Classify outcome
    let core_failed = stage_results.iter()
        .any(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);
    let optional_failed = stage_results.iter()
        .any(|s| (s.stage == "docs" || s.stage == "embeddings") && !s.success);
    let all_success = stage_results.iter().all(|s| s.success);

    let outcome = if all_success {
        "success"
    } else if !core_failed && optional_failed {
        "degraded"
    } else {
        "failed"
    };

    let run = SyncRunRecord {
        timestamp_iso: chrono::Utc::now().to_rfc3339(),
        timestamp_ms: clock.now_ms(),
        duration_seconds: start.elapsed().as_secs_f64(),
        outcome: outcome.to_string(),
        stage_results: stage_results.clone(),
        error_message: if outcome == "failed" {
            stage_results.iter()
                .find(|s| !s.success)
                .and_then(|s| s.error.clone())
        } else {
            None
        },
    };
    status.record_run(run);

    match outcome {
        "success" | "degraded" => {
            // Degraded does NOT count as a failure — core data is fresh
            status.consecutive_failures = 0;
            status.next_retry_at_ms = None;
            status.paused_reason = None;
            status.last_error_code = None;
            status.last_error_message = None;
        }
        "failed" => {
            let core_error = stage_results.iter()
                .find(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);

            // Check if the underlying error is permanent
            if let Some(stage) = core_error {
                if is_permanent_stage_error(stage) {
                    status.paused_reason = Some(format!(
                        "{}: {}",
                        stage.stage,
                        stage.error.as_deref().unwrap_or("unknown error")
                    ));
                    status.last_error_code = Some("PERMANENT".to_string());
                    status.last_error_message = stage.error.clone();
                    // Don't increment consecutive_failures — we're pausing
                } else {
                    status.consecutive_failures = status.consecutive_failures.saturating_add(1);
                    status.last_error_code = Some("TRANSIENT".to_string());
                    status.last_error_message = stage.error.clone();

                    // Circuit breaker check
                    if status.consecutive_failures >= manifest.max_transient_failures {
                        status.paused_reason = Some(format!(
                            "CIRCUIT_BREAKER: {} consecutive transient failures (last: {})",
                            status.consecutive_failures,
                            stage.error.as_deref().unwrap_or("unknown")
                        ));
                        status.circuit_breaker_paused_at_ms = Some(clock.now_ms());
                        status.next_retry_at_ms = None; // paused, not backing off
                    } else {
                        // Extract retry hint from stage error if available (e.g., Retry-After header)
                        let retry_hint = extract_retry_after_hint(stage);
                        status.set_backoff(manifest.interval_seconds, &clock, &mut rng, retry_hint);
                    }
                }
            }
        }
        _ => unreachable!(),
    }

    // 9. Clear current_run (run is complete)
    status.current_run = None;

    // 10. Write status atomically (best-effort)
    if let Err(e) = status.write_atomic(&status_path) {
        tracing::warn!(error = %e, "Failed to write sync status file");
    }

    // 10. Release lock (drop)
    drop(lock);

    // 11. Print result
    print_robot_json(json!({
        "ok": true,
        "data": {
            "action": if outcome == "failed" && status.paused_reason.is_some() { "paused" } else { "sync_completed" },
            "outcome": outcome,
            "profile": manifest.profile,
            "duration_seconds": start.elapsed().as_secs_f64(),
            "stage_results": stage_results,
            "consecutive_failures": status.consecutive_failures,
        },
        "meta": { "elapsed_ms": start.elapsed().as_millis() }
    }));

    Ok(())
}

Error classification helpers

/// Classify by ErrorCode (used when we have the LoreError directly)
fn is_permanent_error(e: &LoreError) -> bool {
    matches!(
        e.code(),
        ErrorCode::TokenNotSet
            | ErrorCode::AuthFailed
            | ErrorCode::ConfigNotFound
            | ErrorCode::ConfigInvalid
            | ErrorCode::MigrationFailed
    )
}

/// Classify from error_code string (primary) or error message string (fallback).
/// The error_code field is propagated through stage execution and is the
/// preferred classification mechanism. String matching on the error message
/// is a fallback for stages that don't yet propagate error_code.
fn is_permanent_stage_error(stage: &StageResult) -> bool {
    // Primary: classify by machine-readable error code
    if let Some(code) = &stage.error_code {
        return matches!(
            code.as_str(),
            "TOKEN_NOT_SET" | "AUTH_FAILED" | "CONFIG_NOT_FOUND"
                | "CONFIG_INVALID" | "MIGRATION_FAILED"
        );
    }
    // Fallback: string matching (for stages that don't yet propagate error_code)
    stage.error.as_deref().map_or(false, |m| {
        m.contains("401 Unauthorized")
            || m.contains("TokenNotSet")
            || m.contains("ConfigNotFound")
            || m.contains("ConfigInvalid")
            || m.contains("MigrationFailed")
    })
}

Implementation note: The error_code field on StageResult is the primary classification mechanism. Each stage's execution wrapper should catch LoreError, extract its ErrorCode via .code().to_string(), and populate the error_code field. The string-matching fallback exists for robustness but should not be the primary path.

Pipeline lock

The sync_pipeline lock uses the existing AppLock mechanism (same as the ingest lock). It prevents:

Two service run invocations overlapping (if scheduler fires before previous run completes)
A service run overlapping with a manual lore sync (the manual sync should also acquire this lock)

Change to handle_sync_cmd: Add sync_pipeline lock acquisition at the top of handle_sync_cmd as well. This is the only change to the manual sync path — no backoff, no status file writes. If the lock is already held by a service run, manual sync waits briefly then fails with a clear message ("A scheduled sync is in progress. Wait for it to complete or use --force to override.").

// In handle_sync_cmd, after config load:
let _pipeline_lock = AppLock::try_acquire("sync_pipeline", stale_lock_minutes)
    .map_err(|_| LoreError::ServiceError {
        message: "Another sync is in progress. Wait for it to complete or use --force.".into(),
    })?;

Platform Backends

Architecture

src/cli/commands/service/platform/mod.rs exports free functions that dispatch via #[cfg(target_os)]. All functions take service_id to construct platform-specific identifiers:

pub fn install(service_id: &str, ...) -> Result<InstallResult> {
    #[cfg(target_os = "macos")]
    return launchd::install(service_id, ...);
    #[cfg(target_os = "linux")]
    return systemd::install(service_id, ...);
    #[cfg(target_os = "windows")]
    return schtasks::install(service_id, ...);
    #[cfg(not(any(target_os = "macos", target_os = "linux", target_os = "windows")))]
    return Err(LoreError::ServiceUnsupported);
}

Same pattern for uninstall(), is_installed(), get_state(), service_file_paths(), platform_name().

Architecture note: A SchedulerBackend trait is the target architecture for deterministic integration testing with a FakeBackend that simulates install/uninstall/state without touching the OS. For v1, the #[cfg] dispatch + run_cmd helper provides adequate testability — unit tests validate template generation (string output, no OS calls) and run_cmd captures all OS interactions with kill+reap timeout handling. The function signatures already mirror the trait shape (install, uninstall, is_installed, get_state, service_file_paths, check_prerequisites), making the trait extraction a low-risk refactoring target for v2. When extracted, the trait should be parameterized by service_id and return Result<T> for all operations.

Command Runner Helper

All platform backends use a shared run_cmd helper for consistent error handling:

/// Execute a system command with timeout and stderr capture.
/// Returns stdout on success, ServiceCommandFailed on failure.
/// On timeout, kills the child process and waits to reap it (prevents zombie processes).
fn run_cmd(program: &str, args: &[&str], timeout_secs: u64) -> Result<String> {
    let mut child = std::process::Command::new(program)
        .args(args)
        .stdout(std::process::Stdio::piped())
        .stderr(std::process::Stdio::piped())
        .spawn()
        .map_err(|e| LoreError::ServiceCommandFailed {
            cmd: format!("{} {}", program, args.join(" ")),
            exit_code: None,
            stderr: e.to_string(),
        })?;

    // Wait with timeout; on timeout kill and reap
    // This prevents process leaks that can wedge repeated runs.
    let output = wait_with_timeout_kill_and_reap(&mut child, timeout_secs)?;

    if output.status.success() {
        Ok(String::from_utf8_lossy(&output.stdout).to_string())
    } else {
        Err(LoreError::ServiceCommandFailed {
            cmd: format!("{} {}", program, args.join(" ")),
            exit_code: output.status.code(),
            stderr: String::from_utf8_lossy(&output.stderr).to_string(),
        })
    }
}

/// Wait for child process with timeout. On timeout, sends SIGKILL and waits
/// for the process to be reaped (prevents zombie processes on Unix).
///
/// NOTE: stdout/stderr are read after exit. This is safe for scheduler commands
/// (launchctl, systemctl, schtasks) which produce small output. For commands
/// that could produce large output (>64KB), concurrent draining via threads or
/// `child.wait_with_output()` would be needed to prevent pipe backpressure deadlock.
fn wait_with_timeout_kill_and_reap(
    child: &mut std::process::Child,
    timeout_secs: u64,
) -> Result<std::process::Output> {
    use std::time::{Duration, Instant};

    let deadline = Instant::now() + Duration::from_secs(timeout_secs);

    loop {
        match child.try_wait() {
            Ok(Some(status)) => {
                let stdout = child.stdout.take().map_or(Vec::new(), |mut s| {
                    let mut buf = Vec::new();
                    std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
                    buf
                });
                let stderr = child.stderr.take().map_or(Vec::new(), |mut s| {
                    let mut buf = Vec::new();
                    std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
                    buf
                });
                return Ok(std::process::Output { status, stdout, stderr });
            }
            Ok(None) => {
                if Instant::now() >= deadline {
                    // Timeout: kill and reap
                    let _ = child.kill();
                    let _ = child.wait(); // reap to prevent zombie
                    return Err(LoreError::ServiceCommandFailed {
                        cmd: "(timeout)".into(),
                        exit_code: None,
                        stderr: format!("Process timed out after {timeout_secs}s"),
                    });
                }
                std::thread::sleep(Duration::from_millis(100));
            }
            Err(e) => return Err(LoreError::ServiceCommandFailed {
                cmd: "(wait)".into(),
                exit_code: None,
                stderr: e.to_string(),
            }),
        }
    }
}

This ensures all launchctl, systemctl, and schtasks failures produce consistent, machine-readable errors with the exact command, exit code, and stderr captured.

Token Storage Helper

/// Write token to a user-private env file, scoped by service_id.
/// Returns the path to the env file.
///
/// Rejects tokens containing NUL bytes or newlines to prevent env-file injection.
/// The token is written as a raw value (not shell-quoted) and read via `cat` in
/// the wrapper script, never `source`d or `eval`d.
fn write_token_env_file(
    data_dir: &Path,
    service_id: &str,
    token_env_var: &str,
    token_value: &str,
) -> Result<PathBuf> {
    // Validate token content — reject values that could break env-file format
    if token_value.contains('\0') || token_value.contains('\n') || token_value.contains('\r') {
        return Err(LoreError::ServiceError {
            message: "Token contains NUL or newline characters, which are not safe for env-file storage. \
                      Use --token-source embedded instead.".into(),
        });
    }

    let env_path = data_dir.join(format!("service-env-{service_id}"));
    let content = format!("{}={}\n", token_env_var, token_value);

    // Write atomically: tmp file + fsync
    let tmp_path = env_path.with_extension("tmp");
    std::fs::write(&tmp_path, &content)?;

    // Set permissions to 0600 (owner read/write only) BEFORE rename
    #[cfg(unix)]
    {
        use std::os::unix::fs::PermissionsExt;
        std::fs::set_permissions(&tmp_path, std::fs::Permissions::from_mode(0o600))?;
    }

    std::fs::rename(&tmp_path, &env_path)?;
    Ok(env_path)
}

Function signatures

pub struct InstallResult {
    pub platform: String,
    pub service_id: String,
    pub interval_seconds: u64,
    pub profile: String,
    pub binary_path: String,
    pub config_path: Option<String>,
    pub service_files: Vec<String>,
    pub sync_command: String,
    pub token_env_var: String,
    pub token_source: String,  // "env_file", "embedded", or "system_env"
}

pub struct UninstallResult {
    pub was_installed: bool,
    pub service_id: String,
    pub platform: String,
    pub removed_files: Vec<String>,
}

pub fn install(
    service_id: &str,
    binary_path: &str,
    config_path: Option<&str>,
    interval_seconds: u64,
    profile: &str,
    token_env_var: &str,
    token_value: &str,
    token_source: &str,
    log_dir: &Path,
    data_dir: &Path,
) -> Result<InstallResult>;

pub fn uninstall(service_id: &str) -> Result<UninstallResult>;
pub fn is_installed(service_id: &str) -> bool;
pub fn get_state(service_id: &str) -> Option<String>; // "loaded", "running", etc.
pub fn service_file_paths(service_id: &str) -> Vec<PathBuf>;
pub fn platform_name() -> &'static str;

/// Pre-flight check for platform-specific prerequisites.
/// Returns a list of diagnostic results.
pub fn check_prerequisites() -> Vec<DiagnosticCheck>;

pub struct DiagnosticCheck {
    pub name: String,
    pub status: DiagnosticStatus, // Pass, Warn, Fail
    pub message: Option<String>,
    pub action: Option<String>,   // Suggested fix command
}

macOS: launchd (`platform/launchd.rs`)

Service file: ~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plist

Label: com.gitlore.sync.{service_id}

Wrapper script approach: launchd cannot natively load environment files. Instead of embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script that reads the env file at runtime and execs lore. This keeps the token out of the plist entirely for the env-file strategy.

Wrapper script ({data_dir}/service-run-{service_id}.sh, mode 0700):

#!/bin/sh
# Generated by lore service install — do not edit
set -e
# Read token from env file (KEY=VALUE format) — never source/eval untrusted content
{token_env_var}="$(sed -n 's/^{token_env_var}=//p' "{data_dir}/service-env-{service_id}")"
export {token_env_var}
{config_export_line}
exec "{binary_path}" --robot service run --service-id "{service_id}"

Where {config_export_line} is either empty or export LORE_CONFIG_PATH="{config_path}".

Plist template (generated via format!(), no crate needed):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.gitlore.sync.{service_id}</string>
    <key>ProgramArguments</key>
    <array>
        {program_arguments}
    </array>
    {env_dict}
    <key>StartInterval</key>
    <integer>{interval_seconds}</integer>
    <key>RunAtLoad</key>
    <true/>
    <key>ProcessType</key>
    <string>Background</string>
    <key>Nice</key>
    <integer>10</integer>
    <key>LowPriorityIO</key>
    <true/>
    <key>StandardOutPath</key>
    <string>{log_dir}/service-{service_id}-stdout.log</string>
    <key>StandardErrorPath</key>
    <string>{log_dir}/service-{service_id}-stderr.log</string>
    <key>TimeOut</key>
    <integer>600</integer>
</dict>
</plist>

Where {program_arguments} and {env_dict} depend on token_source:

env-file (default): The plist invokes the wrapper script instead of lore directly. No token appears in the plist.
```
      <string>{data_dir}/service-run-{service_id}.sh</string>
```
{env_dict} is empty (the wrapper script handles environment setup).

embedded: The plist invokes lore directly with the token embedded in EnvironmentVariables.

      <string>{binary_path}</string>
      <string>--robot</string>
      <string>service</string>
      <string>run</string>
      <string>--service-id</string>
      <string>{service_id}</string>

  <key>EnvironmentVariables</key>
  <dict>
      <key>{token_env_var}</key>
      <string>{token_value}</string>
      {config_env_entry}
  </dict>

Where {config_env_entry} is either empty or:

        <key>LORE_CONFIG_PATH</key>
        <string>{config_path}</string>

XML escaping: The token value and paths must be XML-escaped. Write a helper fn xml_escape(s: &str) -> String that replaces &, <, >, ", ' with their XML entity equivalents. This is critical — tokens can contain & or <.

Install steps:

std::fs::create_dir_all(plist_path.parent())
std::fs::write(&plist_path, plist_content)
Try launchctl bootstrap gui/{uid} {plist_path} via std::process::Command
If that fails (older macOS), fall back to launchctl load {plist_path}
Get UID via safe wrapper: fn current_uid() -> u32 { unsafe { libc::getuid() } } — isolated in a single-line function with #[allow(unsafe_code)] exemption since getuid() is trivially safe (no pointers, no mutation, always succeeds). Alternatively, use the nix crate's nix::unistd::Uid::current() if already a dependency.

Uninstall steps:

Try launchctl bootout gui/{uid}/com.gitlore.sync.{service_id}
If that fails, try launchctl unload {plist_path}
std::fs::remove_file(&plist_path) (ignore error if doesn't exist)

State detection:

is_installed(service_id): check if plist file exists on disk
get_state(service_id): run launchctl list com.gitlore.sync.{service_id}, parse exit code (0 = loaded, non-0 = not loaded)
get_interval_seconds(service_id): read plist file, find <key>StartInterval</key> then next <integer> value via simple string search (no XML parser needed)

Platform prerequisites (check_prerequisites):

Verify running in a GUI login session: check launchctl print gui/{uid} succeeds. In SSH-only or headless contexts, launchd user agents won't load — return Fail with action "Log in via GUI or use SSH with ForwardAgent".
This is a warning, not a hard block — some macOS setups (like launchctl asuser) can work around it.

Linux: systemd (`platform/systemd.rs`)

Service files:

~/.config/systemd/user/lore-sync-{service_id}.service
~/.config/systemd/user/lore-sync-{service_id}.timer

Service unit (hardened):

[Unit]
Description=Gitlore GitLab data sync ({service_id})

[Service]
Type=oneshot
ExecStart={binary_path} --robot service run --service-id {service_id}
WorkingDirectory={data_dir}
SuccessExitStatus=0
TimeoutStartSec=900
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths={data_dir}
{token_env_line}
{config_env_line}

Where {token_env_line} depends on token_source:

env-file: EnvironmentFile={data_dir}/service-env-{service_id} (systemd natively supports this — true file-based loading, no embedding)
embedded: Environment={token_env_var}={token_value}

Where {config_env_line} is either empty or Environment=LORE_CONFIG_PATH={config_path}.

Hardening notes:

TimeoutStartSec=900 — kills stuck syncs after 15 minutes (generous but bounded)
NoNewPrivileges=true — prevents privilege escalation
PrivateTmp=true — isolated /tmp
ProtectSystem=strict — read-only filesystem except explicitly allowed paths
ProtectHome=read-only — read-only home directory
ReadWritePaths={data_dir} — allows writing to the lore data directory (status files, logs, DB)

Timer unit:

[Unit]
Description=Gitlore sync timer ({service_id})

[Timer]
OnBootSec=5min
OnUnitInactiveSec={interval_seconds}s
AccuracySec=1min
Persistent=true
RandomizedDelaySec=60

[Install]
WantedBy=timers.target

Install steps:

std::fs::create_dir_all(unit_dir)
Write both files
Run systemctl --user daemon-reload
Run systemctl --user enable --now lore-sync-{service_id}.timer

Uninstall steps:

Run systemctl --user disable --now lore-sync-{service_id}.timer (ignore error)
Remove both files
Run systemctl --user daemon-reload

State detection:

is_installed(service_id): check if timer file exists
get_state(service_id): run systemctl --user is-active lore-sync-{service_id}.timer, capture stdout ("active", "inactive", etc.)
get_interval_seconds(service_id): read timer file, parse OnUnitInactiveSec value

Platform prerequisites (check_prerequisites):

User manager running: Check systemctl --user status exits 0. If not, return Fail with message "systemd user manager not running. Start a user session or contact your system administrator."
Linger enabled: Check loginctl show-user $(whoami) --property=Linger returns Linger=yes. If not, return Warn with message "loginctl linger not enabled. Timer will not fire on reboot without an active login session." and action loginctl enable-linger $(whoami). This is a warning, not a block — the timer works fine when the user is logged in.

Windows: schtasks (`platform/schtasks.rs`)

Task name: LoreSync-{service_id}

Install:

schtasks /create /tn "LoreSync-{service_id}" /tr "\"{binary_path}\" --robot service run --service-id {service_id}" /sc minute /mo {interval_minutes} /f

Note: /mo requires minutes, so convert seconds to minutes (round up). Minimum is 1 minute (but we enforce 5 minutes at the parse level).

Token handling on Windows: The env var must be set system-wide via setx or be present in the user's environment. Neither env-file nor embedded strategies apply — Windows scheduled tasks inherit the user's environment. Set token_source: "system_env" in the result and document this as a requirement.

Uninstall:

schtasks /delete /tn "LoreSync-{service_id}" /f

State detection:

is_installed(service_id): run schtasks /query /tn "LoreSync-{service_id}", check exit code (0 = exists)
get_state(service_id): parse output of schtasks /query /tn "LoreSync-{service_id}" /fo CSV /v, extract "Status" column
get_interval_seconds(service_id): parse "Repeat: Every" from verbose output, or store the value ourselves

Platform prerequisites (check_prerequisites):

Verify schtasks is available: run schtasks /? and check exit code. Return Fail if not found.

Interval Parsing

/// Parse interval strings like "15m", "1h", "30m", "2h", "24h"
/// Only minutes (m) and hours (h) are accepted — seconds are not exposed
/// because the minimum interval is 5 minutes and sub-minute granularity
/// would be confusing for a scheduled sync.
pub fn parse_interval(input: &str) -> std::result::Result<u64, String> {
    let input = input.trim();

    let (num_str, multiplier) = if let Some(n) = input.strip_suffix('m') {
        (n, 60u64)
    } else if let Some(n) = input.strip_suffix('h') {
        (n, 3600u64)
    } else {
        return Err(format!(
            "Invalid interval '{input}'. Use format like 15m, 30m, 1h, 2h"
        ));
    };

    let num: u64 = num_str
        .parse()
        .map_err(|_| format!("Invalid number in interval: '{num_str}'"))?;

    if num == 0 {
        return Err("Interval must be greater than 0".to_string());
    }

    let seconds = num * multiplier;

    if seconds < 300 {
        return Err(format!(
            "Minimum interval is 5m (got {input}, which is {seconds}s)"
        ));
    }
    if seconds > 86400 {
        return Err(format!(
            "Maximum interval is 24h (got {input}, which is {seconds}s)"
        ));
    }

    Ok(seconds)
}

Error Types

Additions to `src/core/error.rs`

ErrorCode enum:

ServiceError,          // Add after Ambiguous
ServiceCommandFailed,  // OS command (launchctl/systemctl/schtasks) failed
ServiceCorruptState,   // Manifest or status file is corrupt/unparseable

ErrorCode::exit_code():

Self::ServiceError => 21,
Self::ServiceCommandFailed => 22,
Self::ServiceCorruptState => 23,

ErrorCode::Display:

Self::ServiceError => "SERVICE_ERROR",
Self::ServiceCommandFailed => "SERVICE_COMMAND_FAILED",
Self::ServiceCorruptState => "SERVICE_CORRUPT_STATE",

LoreError enum:

#[error("Service error: {message}")]
ServiceError { message: String },

#[error("Service management not supported on this platform. Requires macOS (launchd), Linux (systemd), or Windows (schtasks).")]
ServiceUnsupported,

#[error("Service command failed: {cmd} (exit {exit_code:?}): {stderr}")]
ServiceCommandFailed {
    cmd: String,
    exit_code: Option<i32>,
    stderr: String,
},

#[error("Service state file corrupt: {path}: {reason}")]
ServiceCorruptState {
    path: String,
    reason: String,
},

LoreError::code():

Self::ServiceError { .. } => ErrorCode::ServiceError,
Self::ServiceUnsupported => ErrorCode::ServiceError,
Self::ServiceCommandFailed { .. } => ErrorCode::ServiceCommandFailed,
Self::ServiceCorruptState { .. } => ErrorCode::ServiceCorruptState,

LoreError::suggestion():

Self::ServiceError { .. } => Some("Check service status: lore service status\nRun diagnostics: lore service doctor\nView logs: lore service logs"),
Self::ServiceUnsupported => Some("Requires macOS (launchd), Linux (systemd), or Windows (schtasks)"),
Self::ServiceCommandFailed { .. } => Some("Check service logs: lore service logs\nRun diagnostics: lore service doctor\nTry reinstalling: lore service install"),
Self::ServiceCorruptState { .. } => Some("Run: lore service repair\nThen reinstall if needed: lore service install"),

LoreError::actions():

Self::ServiceError { .. } => vec!["lore service status", "lore service doctor", "lore service logs"],
Self::ServiceUnsupported => vec![],
Self::ServiceCommandFailed { .. } => vec!["lore service logs", "lore service doctor", "lore service install"],
Self::ServiceCorruptState { .. } => vec!["lore service repair", "lore service install"],

CLI Definition Changes

`src/cli/mod.rs`

Add to Commands enum (after Who(WhoArgs), before the hidden commands):

/// Manage the OS-native scheduled sync service
Service {
    #[command(subcommand)]
    command: ServiceCommand,
},

Add the ServiceCommand enum (can be in the same file or re-exported from service/mod.rs):

#[derive(Subcommand)]
pub enum ServiceCommand {
    /// Install the scheduled sync service
    Install {
        /// Sync interval (e.g., 15m, 30m, 1h). Default: 30m. Min: 5m. Max: 24h.
        #[arg(long, default_value = "30m")]
        interval: String,
        /// Sync profile: fast (issues+MRs), balanced (+ docs), full (+ embeddings)
        #[arg(long, default_value = "balanced")]
        profile: String,
        /// Token storage: env-file (default, 0600 perms) or embedded (in service file)
        #[arg(long, default_value = "env-file")]
        token_source: String,
        /// Custom service name (default: derived from config path hash).
        /// Useful when managing multiple installations for readability.
        #[arg(long)]
        name: Option<String>,
        /// Validate and render service files without writing or executing anything
        #[arg(long)]
        dry_run: bool,
    },
    /// Remove the scheduled sync service
    Uninstall {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
        /// Uninstall all services
        #[arg(long)]
        all: bool,
    },
    /// List all installed services
    List,
    /// Show service status and last sync result
    Status {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// View service logs
    Logs {
        /// Show last N lines (default: 100)
        #[arg(long)]
        tail: Option<Option<usize>>,
        /// Stream new log lines as they arrive (like tail -f)
        #[arg(long)]
        follow: bool,
        /// Open log file in editor instead of printing to stdout
        #[arg(long)]
        open: bool,
        /// Target a specific service by ID or name
        #[arg(long)]
        service: Option<String>,
    },
    /// Clear paused state and reset failure counter
    Resume {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Pause scheduled execution without uninstalling
    Pause {
        /// Reason for pausing (shown in status output)
        #[arg(long)]
        reason: Option<String>,
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Trigger an immediate one-off sync using installed profile
    Trigger {
        /// Bypass backoff window (still respects paused state)
        #[arg(long)]
        ignore_backoff: bool,
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Repair corrupt manifest or status files
    Repair {
        /// Target a specific service by ID or name (default: current project)
        #[arg(long)]
        service: Option<String>,
    },
    /// Validate service environment and prerequisites
    Doctor {
        /// Skip network checks (token validation)
        #[arg(long)]
        offline: bool,
        /// Attempt safe, non-destructive fixes for detected issues
        #[arg(long)]
        fix: bool,
    },
    /// Execute one scheduled sync attempt (called by OS scheduler, hidden from help)
    #[command(hide = true)]
    Run {
        /// Internal selector injected by scheduler backend — identifies which
        /// service manifest and status file to use for this run.
        #[arg(long, hide = true)]
        service_id: String,
    },
}

`src/cli/commands/mod.rs`

Add:

pub mod service;

No re-exports needed — the dispatch goes through service::handle_install, etc. directly.

`src/main.rs` dispatch

Add import:

use lore::cli::ServiceCommand;

Add match arm (before the hidden commands):

Some(Commands::Service { command }) => {
    handle_service(cli.config.as_deref(), command, robot_mode)
}

Add handler function:

fn handle_service(
    config_override: Option<&str>,
    command: ServiceCommand,
    robot_mode: bool,
) -> Result<(), Box<dyn std::error::Error>> {
    let start = std::time::Instant::now();
    match command {
        ServiceCommand::Install { interval, profile, token_source, name, dry_run } => {
            lore::cli::commands::service::handle_install(
                config_override, &interval, &profile, &token_source, name.as_deref(),
                dry_run, robot_mode, start,
            )
        }
        ServiceCommand::Uninstall { service, all } => {
            lore::cli::commands::service::handle_uninstall(service.as_deref(), all, robot_mode, start)
        }
        ServiceCommand::List => {
            lore::cli::commands::service::handle_list(robot_mode, start)
        }
        ServiceCommand::Status { service } => {
            lore::cli::commands::service::handle_status(config_override, service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Logs { tail, follow, open, service } => {
            lore::cli::commands::service::handle_logs(tail, follow, open, service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Resume { service } => {
            lore::cli::commands::service::handle_resume(service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Pause { reason, service } => {
            lore::cli::commands::service::handle_pause(service.as_deref(), reason.as_deref(), robot_mode, start)
        }
        ServiceCommand::Trigger { ignore_backoff, service } => {
            lore::cli::commands::service::handle_trigger(service.as_deref(), ignore_backoff, robot_mode, start)
        }
        ServiceCommand::Repair { service } => {
            lore::cli::commands::service::handle_repair(service.as_deref(), robot_mode, start)
        }
        ServiceCommand::Doctor { offline, fix } => {
            lore::cli::commands::service::handle_doctor(config_override, offline, fix, robot_mode, start)
        }
        ServiceCommand::Run { service_id } => {
            // Always robot mode for scheduled execution
            lore::cli::commands::service::handle_service_run(&service_id, start)
        }
    }
}

Autocorrect Registry

`src/cli/autocorrect.rs`

Add to COMMAND_FLAGS array (before the hidden commands):

("service", &["--interval", "--profile", "--token-source", "--name", "--dry-run", "--tail", "--follow", "--open", "--offline", "--fix", "--service", "--all", "--reason", "--ignore-backoff"]),

Important: The registry_covers_command_flags test in autocorrect.rs uses clap introspection to verify all flags are registered. Since service is a nested subcommand, verify whether this test recurses into subcommands. If it does, the test will fail without this entry. If it doesn't recurse (only checks top-level subcommands), the test passes but we should still add the entry for correctness.

Looking at the test (lines 868-908): it iterates cmd.get_subcommands() which gets the top-level subcommands. The Service variant uses #[command(subcommand)] which means clap will show service as a subcommand with its own sub-subcommands. The test won't recurse into install's flags, but service itself has no direct flags (only subcommands do), so an empty entry or omission would pass the test. Adding ("service", &["--interval"]) is conservative and correct — the --interval flag lives on the install sub-subcommand but won't cause issues.

However, detect_subcommand only finds the first positional arg. For lore service install --intervl 30m, it returns "service", not "install". So the --interval flag needs to be registered under "service" for fuzzy matching.

robot-docs Manifest

Addition to `handle_robot_docs` in `src/main.rs`

Add to the commands JSON object:

"service": {
    "description": "Manage OS-native scheduled sync service",
    "subcommands": {
        "install": {
            "description": "Install scheduled sync service",
            "flags": ["--interval <duration>", "--profile <fast|balanced|full>", "--token-source <env-file|embedded>", "--name <optional>", "--dry-run"],
            "defaults": { "interval": "30m", "profile": "balanced", "token_source": "env-file" },
            "example": "lore --robot service install --interval 15m --profile fast",
            "response_schema": {
                "ok": "bool",
                "data.platform": "string (launchd|systemd|schtasks)",
                "data.service_id": "string",
                "data.interval_seconds": "number",
                "data.profile": "string",
                "data.binary_path": "string",
                "data.service_files": "[string]",
                "data.token_source": "string (env_file|embedded|system_env)",
                "data.no_change": "bool"
            }
        },
        "uninstall": {
            "description": "Remove scheduled sync service",
            "flags": ["--service <service_id|name>", "--all"],
            "example": "lore --robot service uninstall",
            "response_schema": {
                "ok": "bool",
                "data.was_installed": "bool",
                "data.service_id": "string",
                "data.removed_files": "[string]"
            }
        },
        "list": {
            "description": "List all installed services",
            "example": "lore --robot service list",
            "response_schema": {
                "ok": "bool",
                "data.services": "[{service_id, platform, interval_seconds, profile, installed_at_iso, platform_state, drift}]"
            }
        },
        "status": {
            "description": "Show service status, scheduler state, and recent runs",
            "flags": ["--service <service_id|name>"],
            "example": "lore --robot service status",
            "response_schema": {
                "ok": "bool",
                "data.installed": "bool",
                "data.service_id": "string|null",
                "data.platform": "string",
                "data.interval_seconds": "number|null",
                "data.profile": "string|null",
                "data.scheduler_state": "string (idle|running|running_stale|degraded|backoff|half_open|paused|not_installed)",
                "data.last_sync": "SyncRunRecord|null",
                "data.recent_runs": "[SyncRunRecord]",
                "data.backoff": "object|null",
                "data.paused_reason": "string|null",
                "data.drift": "object|null {platform_drift: bool, spec_drift: bool, command_drift: bool}"
            }
        },
        "logs": {
            "description": "View service logs (human: editor/tail, robot: path + optional lines)",
            "flags": ["--tail <n>", "--follow"],
            "example": "lore --robot service logs --tail 50",
            "response_schema": {
                "ok": "bool",
                "data.log_path": "string",
                "data.exists": "bool",
                "data.size_bytes": "number",
                "data.last_lines": "[string]|null"
            }
        },
        "resume": {
            "description": "Clear paused state and reset failure counter",
            "example": "lore --robot service resume",
            "response_schema": {
                "ok": "bool",
                "data.was_paused": "bool",
                "data.previous_reason": "string|null",
                "data.consecutive_failures_cleared": "number"
            }
        },
        "pause": {
            "description": "Pause scheduled execution without uninstalling",
            "flags": ["--reason <text>", "--service <service_id|name>"],
            "example": "lore --robot service pause --reason 'maintenance'",
            "response_schema": {
                "ok": "bool",
                "data.service_id": "string",
                "data.paused": "bool",
                "data.reason": "string",
                "data.already_paused": "bool"
            }
        },
        "trigger": {
            "description": "Trigger immediate one-off sync using installed profile",
            "flags": ["--ignore-backoff", "--service <service_id|name>"],
            "example": "lore --robot service trigger",
            "response_schema": "Same as service run output"
        },
        "repair": {
            "description": "Repair corrupt manifest or status files",
            "flags": ["--service <service_id|name>"],
            "example": "lore --robot service repair",
            "response_schema": {
                "ok": "bool",
                "data.repaired": "bool",
                "data.actions": "[{file, action, backup?}]",
                "data.needs_reinstall": "bool"
            }
        },
        "doctor": {
            "description": "Validate service environment and prerequisites",
            "flags": ["--offline", "--fix"],
            "example": "lore --robot service doctor",
            "response_schema": {
                "ok": "bool",
                "data.checks": "[{name, status, message?, action?}]",
                "data.overall": "string (pass|warn|fail)"
            }
        }
    }
}

Paths Module Additions

`src/core/paths.rs`

pub fn get_service_status_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("sync-status-{service_id}.json"))
}

pub fn get_service_manifest_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-manifest-{service_id}.json"))
}

pub fn get_service_env_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-env-{service_id}"))
}

pub fn get_service_wrapper_path(service_id: &str) -> PathBuf {
    get_data_dir().join(format!("service-run-{service_id}.sh"))
}

pub fn get_service_log_path(service_id: &str, stream: &str) -> PathBuf {
    get_data_dir().join("logs").join(format!("service-{service_id}-{stream}.log"))
}

// stream values: "stdout" or "stderr"
// Example: get_service_log_path("a1b2c3d4e5f6", "stderr")
//   => ~/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log

/// List all installed service IDs by scanning for manifest files.
pub fn list_service_ids() -> Vec<String> {
    let data_dir = get_data_dir();
    std::fs::read_dir(&data_dir)
        .unwrap_or_else(|_| /* return empty iterator */)
        .filter_map(|entry| {
            let name = entry.ok()?.file_name().to_string_lossy().to_string();
            name.strip_prefix("service-manifest-")
                .and_then(|s| s.strip_suffix(".json"))
                .map(String::from)
        })
        .collect()
}

Note: Status files are scoped by service_id — each installed service gets independent backoff/paused/circuit-breaker state. The pipeline lock remains global (sync_pipeline) to prevent overlapping writes to the shared database.

Core Module Registration

`src/core/mod.rs`

Add:

pub mod sync_status;
pub mod service_manifest;

File-by-File Implementation Details

`src/core/sync_status.rs` (NEW)

SyncRunRecord struct with Serialize + Deserialize + Clone
StageResult struct with Serialize + Deserialize + Clone
SyncStatusFile struct with Serialize + Deserialize + Default (schema_version=1)
Clock trait + SystemClock impl (for deterministic testing)
JitterRng trait + ThreadJitterRng impl (for deterministic jitter testing)
parse_interval(input: &str) -> Result<u64, String>
SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError> — distinguishes missing from corrupt
SyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()> — tmp+fsync+rename
SyncStatusFile::record_run(&mut self, run: SyncRunRecord) — push to recent_runs (capped at 10)
SyncStatusFile::clear_paused(&mut self) — reset paused_reason, errors, failures, next_retry_at_ms
SyncStatusFile::backoff_remaining(&self, clock: &dyn Clock) -> Option<u64> — reads persisted next_retry_at_ms
SyncStatusFile::set_backoff(&mut self, base_interval_seconds, clock, rng) — compute and persist next_retry_at_ms
fn is_permanent_error(code: &ErrorCode) -> bool
fn is_permanent_stage_error(stage: &StageResult) -> bool — primary: error_code, fallback: string matching
SyncStatusFile::is_circuit_breaker_half_open(&self, manifest: &ServiceManifest, clock: &dyn Clock) -> bool — checks if cooldown has expired
Unit tests for all of the above

`src/core/service_manifest.rs` (NEW)

ServiceManifest struct with Serialize + Deserialize (schema_version=1), includes workspace_root and spec_hash
ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError> — distinguishes missing from corrupt
ServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()> — tmp+fsync+rename
ServiceManifest::profile_to_sync_args(&self) -> Vec<String> — maps profile to sync CLI flags
compute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String — composite fingerprint (workspace root + config path + sorted project URLs), first 12 hex chars of SHA-256
sanitize_service_name(name: &str) -> Result<String, String> — [a-z0-9-], max 32 chars
DiagnosticCheck struct, DiagnosticStatus enum (Pass/Warn/Fail)
Unit tests for profile mapping, service_id computation, name sanitization

`src/cli/commands/service/mod.rs` (NEW)

Re-exports from submodules: handle_install, handle_uninstall, handle_list, handle_status, handle_logs, handle_resume, handle_pause, handle_trigger, handle_repair, handle_doctor, handle_service_run
Shared resolve_service_id(selector: Option<&str>, config_override: Option<&str>) -> Result<String> helper: resolves --service flag, or derives from current config path. If multiple services exist and no selector provided, returns actionable error listing available services.
Shared acquire_admin_lock(service_id: &str) -> Result<AppLock> helper: acquires AppLock("service-admin-{service_id}") for state mutation commands. Used by install, uninstall, pause, resume, and repair. NOT used by service run (which only acquires sync_pipeline).
Imports from submodules

`src/cli/commands/service/install.rs` (NEW)

handle_install(config_override, interval_str, profile, token_source, name, dry_run, robot_mode, start) -> Result<()>
Validates profile is one of fast|balanced|full
Validates token_source is one of env-file|embedded
Computes or validates service_id from --name or composite fingerprint (workspace root + config path + project URLs). If --name is provided and collides with an existing service with a different identity hash, returns an actionable error.
Acquires admin lock AppLock("service-admin-{service_id}") before mutating any files
Runs doctor pre-flight checks; aborts on any Fail result
Loads config, resolves token, resolves binary path
Writes token to env file (if env-file strategy, scoped by service_id)
On macOS with env-file: generates wrapper script at {data_dir}/service-run-{service_id}.sh (mode 0700)
Calls platform::install(service_id, ...)
Transactional: on enable success, writes install manifest atomically. On enable failure, removes generated service files and wrapper script, returns ServiceCommandFailed.
Compares with existing manifest to detect no-change case
Prints result (robot JSON or human-readable)

`src/cli/commands/service/uninstall.rs` (NEW)

handle_uninstall(service_selector, all, robot_mode, start) -> Result<()>
Resolves target service via selector or current-project default
With --all: iterates all discovered manifests
Reads manifest to find service_id
Calls platform::uninstall(service_id)
Removes install manifest (service-manifest-{service_id}.json)
Removes env file (service-env-{service_id}) if exists
Removes wrapper script (service-run-{service_id}.sh) if exists (macOS)
Does NOT remove the status file or log files (those are operational data, not config)
Outputs confirmation

`src/cli/commands/service/status.rs` (NEW)

handle_status(config_override, robot_mode, start) -> Result<()>
Reads install manifest (primary source for config and service_id)
Calls platform::is_installed(service_id), get_state(service_id) to verify platform state
Detects drift: platform drift (loaded/unloaded), spec drift (content hash vs spec_hash), command drift
Reads SyncStatusFile for last sync and recent runs
Detects stale runs via current_run metadata: checks if PID is alive and started_at_ms is within 30 minutes
Computes scheduler state from status + manifest (including degraded, running_stale)
Computes backoff info from persisted next_retry_at_ms
Prints combined status

`src/cli/commands/service/logs.rs` (NEW)

handle_logs(tail, follow, robot_mode, start) -> Result<()>
--tail: read last N lines, output directly to stdout (or as JSON array in robot mode)
--follow: stream new lines (human mode only; robot mode returns error)
Default (no flags): print last 100 lines to stdout (human) or return path metadata (robot)
Robot mode with --tail: includes last_lines field (capped at 100)

`src/cli/commands/service/resume.rs` (NEW)

handle_resume(robot_mode, start) -> Result<()>
Reads status file, clears paused state (including circuit breaker), writes back atomically
Prints confirmation with previous reason

`src/cli/commands/service/doctor.rs` (NEW)

handle_doctor(config_override, offline, fix, robot_mode, start) -> Result<()>
Runs diagnostic checks: config, token, binary, data dir, platform prerequisites, install state
Skips network checks when --offline
--fix: attempts safe, non-destructive remediations (create dirs, fix permissions, daemon-reload). Reports each applied fix.
Reports pass/warn/fail per check
Also used as pre-flight by handle_install (as an internal function call, without --fix)

`src/cli/commands/service/run.rs` (NEW)

handle_service_run(service_id: &str, start) -> Result<()>
The hidden scheduled execution entrypoint; service_id is injected by the scheduler command line
Reads manifest for the given service_id to get profile/interval/max_transient_failures/circuit_breaker_cooldown_seconds
Checks paused state with half-open transition (cooldown check), backoff (via persisted next_retry_at_ms), pipeline lock
Writes current_run metadata (started_at_ms, pid) to status file before sync for stale-run detection; clears it on completion
Executes sync stage-by-stage, records per-stage outcomes with error_code propagation
Classifies: success / degraded / failed
Respects server-provided Retry-After hints when computing backoff (via extract_retry_after_hint)
Circuit breaker check on transient failure count; records circuit_breaker_paused_at_ms for cooldown
Half-open probe: if probe succeeds, auto-closes circuit breaker; if fails, returns to paused with new timestamp
Performs log rotation check before executing sync
Updates status atomically
Always robot mode, always exit 0

`src/cli/commands/service/list.rs` (NEW)

handle_list(robot_mode, start) -> Result<()>
Scans {data_dir} for service-manifest-*.json files
Reads each manifest, verifies platform state, detects drift
Outputs summary in robot JSON or human-readable table

`src/cli/commands/service/pause.rs` (NEW)

handle_pause(service_selector, reason, robot_mode, start) -> Result<()>
Resolves service, writes paused_reason to status file
Does NOT modify OS scheduler (service stays installed and scheduled — it just no-ops)
Reports already_paused: true if already paused (updates reason)

`src/cli/commands/service/trigger.rs` (NEW)

handle_trigger(service_selector, ignore_backoff, robot_mode, start) -> Result<()>
Resolves service, reads manifest for profile
Delegates to handle_service_run logic with optional backoff bypass
Still respects paused state (use resume first)

`src/cli/commands/service/repair.rs` (NEW)

handle_repair(service_selector, robot_mode, start) -> Result<()>
Validates manifest and status files for JSON parseability
Corrupt files: renamed to {name}.corrupt.{timestamp} (backup, never delete)
Status file: reinitialized to default
Manifest: cleared, advises reinstall
Reports what was repaired

`src/cli/commands/service/platform/mod.rs` (NEW)

#[cfg]-gated imports and dispatch functions (all take service_id)
fn xml_escape(s: &str) -> String helper (used by launchd)
fn run_cmd(program, args, timeout_secs) -> Result<String> — shared command runner with kill+reap on timeout
fn wait_with_timeout_kill_and_reap(child, timeout_secs) -> Result<Output> — timeout handler that kills and reaps child process
fn write_token_env_file(data_dir, service_id, token_env_var, token_value) -> Result<PathBuf> — token storage
fn write_wrapper_script(data_dir, service_id, binary_path, token_env_var, config_path) -> Result<PathBuf> — macOS wrapper script for runtime env loading (mode 0700)
fn check_prerequisites() -> Vec<DiagnosticCheck> — platform-specific pre-flight
fn write_atomic(path: &Path, content: &str) -> std::io::Result<()> — shared atomic write helper (tmp + fsync(file) + rename + fsync(parent_dir) for power-loss durability)

`src/cli/commands/service/platform/launchd.rs` (NEW, `#[cfg(target_os = "macos")]`)

fn plist_path(service_id: &str) -> PathBuf — ~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plist
fn generate_plist(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, token_source, log_dir, data_dir) -> String — generates plist with wrapper script (env-file) or direct invocation (embedded)
fn generate_plist_with_wrapper(service_id, wrapper_path, interval_seconds, log_dir) -> String — env-file variant: ProgramArguments points to wrapper script
fn generate_plist_with_embedded(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, log_dir) -> String — embedded variant: token in EnvironmentVariables
fn install(service_id, ...) -> Result<InstallResult>
fn uninstall(service_id) -> Result<UninstallResult>
fn is_installed(service_id) -> bool
fn get_state(service_id) -> Option<String>
fn get_interval_seconds(service_id) -> u64
fn check_prerequisites() -> Vec<DiagnosticCheck> — GUI session check
Unit tests: test_generate_plist_with_wrapper() — verify wrapper path in ProgramArguments, no token in plist
Unit tests: test_generate_plist_with_embedded() — verify token in EnvironmentVariables
Unit tests: XML escaping, service_id in label

`src/cli/commands/service/platform/systemd.rs` (NEW, `#[cfg(target_os = "linux")]`)

fn unit_dir() -> PathBuf — ~/.config/systemd/user/
fn generate_service(service_id, binary_path, config_path, token_env_var, token_value, token_source, data_dir) -> String — includes hardening directives
fn generate_timer(service_id, interval_seconds) -> String
fn install(service_id, ...) -> Result<InstallResult>
fn uninstall(service_id) -> Result<UninstallResult>
Same query functions as launchd (all scoped by service_id)
fn check_prerequisites() -> Vec<DiagnosticCheck> — user manager + linger checks
Unit test: test_generate_service() (both env-file and embedded, verify hardening), test_generate_timer()

`src/cli/commands/service/platform/schtasks.rs` (NEW, `#[cfg(target_os = "windows")]`)

fn install(service_id, ...) -> Result<InstallResult>
fn uninstall(service_id) -> Result<UninstallResult>
Same query functions (scoped by service_id)
fn check_prerequisites() -> Vec<DiagnosticCheck> — schtasks availability
Note: token_source: "system_env" — token must be in system environment

Testing Strategy

Test Infrastructure

Fake clock for deterministic time-dependent tests:

/// Test clock with controllable time
struct FakeClock {
    now_ms: i64,
}

impl Clock for FakeClock {
    fn now_ms(&self) -> i64 {
        self.now_ms
    }
}

Fake RNG for deterministic jitter tests:

/// Test RNG that returns a predetermined sequence of values
struct FakeJitterRng {
    values: Vec<f64>,
    index: usize,
}

impl FakeJitterRng {
    fn new(values: Vec<f64>) -> Self {
        Self { values, index: 0 }
    }
}

impl JitterRng for FakeJitterRng {
    fn next_f64(&mut self) -> f64 {
        let val = self.values[self.index % self.values.len()];
        self.index += 1;
        val
    }
}

This eliminates all time- and randomness-dependent flakiness. Every test sets an explicit "now" and jitter value, then asserts exact results.

Unit Tests (in `src/core/sync_status.rs`)

#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::TempDir;

    struct FakeClock { now_ms: i64 }
    impl Clock for FakeClock {
        fn now_ms(&self) -> i64 { self.now_ms }
    }

    struct FakeJitterRng { value: f64 }
    impl FakeJitterRng {
        fn new(value: f64) -> Self {
            Self { value }
        }
    }

    impl JitterRng for FakeJitterRng {
        fn next_f64(&mut self) -> f64 {
            self.value
        }
    }

    // --- Interval parsing ---

    #[test]
    fn parse_interval_valid_minutes() {
        assert_eq!(parse_interval("5m").unwrap(), 300);
        assert_eq!(parse_interval("15m").unwrap(), 900);
        assert_eq!(parse_interval("30m").unwrap(), 1800);
    }

    #[test]
    fn parse_interval_valid_hours() {
        assert_eq!(parse_interval("1h").unwrap(), 3600);
        assert_eq!(parse_interval("2h").unwrap(), 7200);
        assert_eq!(parse_interval("24h").unwrap(), 86400);
    }

    #[test]
    fn parse_interval_too_short() {
        assert!(parse_interval("1m").is_err());
        assert!(parse_interval("4m").is_err());
    }

    #[test]
    fn parse_interval_too_long() {
        assert!(parse_interval("25h").is_err());
    }

    #[test]
    fn parse_interval_invalid() {
        assert!(parse_interval("0m").is_err());
        assert!(parse_interval("abc").is_err());
        assert!(parse_interval("").is_err());
        assert!(parse_interval("m").is_err());
        assert!(parse_interval("10x").is_err());
        assert!(parse_interval("30s").is_err()); // seconds not supported
    }

    #[test]
    fn parse_interval_trims_whitespace() {
        assert_eq!(parse_interval("  30m  ").unwrap(), 1800);
    }

    // --- Status file persistence ---

    #[test]
    fn status_file_round_trip() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("sync-status-test1234.json");

        let mut status = SyncStatusFile::default();
        let run = SyncRunRecord {
            timestamp_iso: "2026-02-09T10:30:00Z".to_string(),
            timestamp_ms: 1_770_609_000_000,
            duration_seconds: 12.5,
            outcome: "success".to_string(),
            stage_results: vec![
                StageResult { stage: "issues".into(), success: true, items_updated: 5, error: None },
                StageResult { stage: "mrs".into(), success: true, items_updated: 3, error: None },
            ],
            error_message: None,
        };
        status.record_run(run);
        status.write_atomic(&path).unwrap();

        let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
        assert_eq!(loaded.last_run.as_ref().unwrap().outcome, "success");
        assert_eq!(loaded.last_run.as_ref().unwrap().stage_results.len(), 2);
        assert_eq!(loaded.consecutive_failures, 0);
        assert_eq!(loaded.recent_runs.len(), 1);
        assert_eq!(loaded.schema_version, 1);
    }

    #[test]
    fn status_file_read_missing_returns_ok_none() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("nonexistent.json");
        assert!(SyncStatusFile::read(&path).unwrap().is_none());
    }

    #[test]
    fn status_file_read_corrupt_returns_err() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("corrupt.json");
        std::fs::write(&path, "not valid json{{{").unwrap();
        assert!(SyncStatusFile::read(&path).is_err());
    }

    #[test]
    fn status_file_atomic_write_survives_crash() {
        // Verify no partial writes by checking file is valid JSON after write
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("sync-status-test1234.json");
        let status = SyncStatusFile::default();
        status.write_atomic(&path).unwrap();
        // Read back and verify
        let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
        assert_eq!(loaded.schema_version, 1);
    }

    #[test]
    fn record_run_caps_at_10() {
        let mut status = SyncStatusFile::default();
        for i in 0..15 {
            status.record_run(make_run(i * 1000, "success"));
        }
        assert_eq!(status.recent_runs.len(), 10);
    }

    #[test]
    fn default_status_has_no_last_run() {
        let status = SyncStatusFile::default();
        assert!(status.last_run.is_none());
    }

    // --- Backoff (deterministic via FakeClock + persisted next_retry_at_ms) ---

    #[test]
    fn backoff_returns_none_when_zero_failures() {
        let status = make_status("success", 0, 100_000);
        let clock = FakeClock { now_ms: 200_000 };
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn backoff_returns_none_when_no_next_retry() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = None;
        let clock = FakeClock { now_ms: 200_000_000 };
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn backoff_active_within_window() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = Some(100_000_000 + 1_800_000); // 30 min from now
        let clock = FakeClock { now_ms: 100_000_000 + 1000 }; // 1s after failure
        let remaining = status.backoff_remaining(&clock);
        assert!(remaining.is_some());
        assert_eq!(remaining.unwrap(), 1799);
    }

    #[test]
    fn backoff_expired() {
        let mut status = make_status("failed", 1, 100_000_000);
        status.next_retry_at_ms = Some(100_000_000 + 1_800_000);
        let clock = FakeClock { now_ms: 100_000_000 + 2_000_000 }; // past retry time
        assert!(status.backoff_remaining(&clock).is_none());
    }

    #[test]
    fn set_backoff_persists_next_retry() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.5); // 0.5 for deterministic
        status.set_backoff(1800, &clock, &mut rng, None);
        assert!(status.next_retry_at_ms.is_some());
        // With jitter=0.5, backoff = max(1800*0.5, 1800) = 1800s
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_caps_at_4_hours() {
        let mut status = make_status("failed", 20, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(1.0); // max jitter
        status.set_backoff(1800, &clock, &mut rng, None);
        // Cap: 4h = 14400s, jitter=1.0: max(14400*1.0, 1800) = 14400
        let max_ms = 100_000_000 + 14_400_000;
        assert!(status.next_retry_at_ms.unwrap() <= max_ms);
    }

    #[test]
    fn set_backoff_minimum_is_base_interval() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0); // min jitter
        status.set_backoff(1800, &clock, &mut rng, None);
        // jitter=0.0: max(1800*0.0, 1800) = 1800 (minimum enforced)
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_respects_retry_after_hint() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0); // min jitter => computed backoff = 1800s
        let hint = 100_000_000 + 3_600_000; // server says retry after 1 hour
        status.set_backoff(1800, &clock, &mut rng, Some(hint));
        // Hint (1h) > computed backoff (30m), so hint wins
        assert_eq!(status.next_retry_at_ms.unwrap(), hint);
    }

    #[test]
    fn set_backoff_ignores_hint_when_computed_is_larger() {
        let mut status = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.0);
        let hint = 100_000_000 + 60_000; // server says retry after 1 minute
        status.set_backoff(1800, &clock, &mut rng, Some(hint));
        // Computed (30m) > hint (1m), so computed wins
        let expected_ms = 100_000_000 + 1_800_000;
        assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
    }

    #[test]
    fn set_backoff_uses_configured_interval_not_hardcoded() {
        let mut status1 = make_status("failed", 1, 100_000_000);
        let mut status2 = make_status("failed", 1, 100_000_000);
        let clock = FakeClock { now_ms: 100_000_000 };
        let mut rng = FakeJitterRng::new(0.5);

        status1.set_backoff(300, &clock, &mut rng, None);  // 5m base
        rng.value = 0.5; // reset
        status2.set_backoff(3600, &clock, &mut rng, None); // 1h base

        // 5m base should produce shorter backoff than 1h base
        assert!(status1.next_retry_at_ms.unwrap() < status2.next_retry_at_ms.unwrap());
    }

    #[test]
    fn backoff_skips_when_paused() {
        let mut status = make_status("failed", 3, 100_000_000);
        status.paused_reason = Some("AUTH_FAILED".to_string());
        status.next_retry_at_ms = Some(100_000_000 + 999_999_999);
        let clock = FakeClock { now_ms: 100_000_000 + 1000 };
        // Paused state is checked separately, backoff_remaining returns None
        assert!(status.backoff_remaining(&clock).is_none());
    }

    // --- Error classification ---

    #[test]
    fn permanent_errors_classified_correctly() {
        assert!(is_permanent_error(&ErrorCode::TokenNotSet));
        assert!(is_permanent_error(&ErrorCode::AuthFailed));
        assert!(is_permanent_error(&ErrorCode::ConfigNotFound));
        assert!(is_permanent_error(&ErrorCode::ConfigInvalid));
        assert!(is_permanent_error(&ErrorCode::MigrationFailed));
    }

    #[test]
    fn transient_errors_classified_correctly() {
        assert!(!is_permanent_error(&ErrorCode::NetworkError));
        assert!(!is_permanent_error(&ErrorCode::RateLimited));
        assert!(!is_permanent_error(&ErrorCode::DbLocked));
        assert!(!is_permanent_error(&ErrorCode::DbError));
        assert!(!is_permanent_error(&ErrorCode::InternalError));
    }

    // --- Stage-aware outcomes ---

    #[test]
    fn degraded_outcome_does_not_count_as_failure() {
        // When core stages succeed but optional stages fail, consecutive_failures should reset
        let mut status = make_status("failed", 3, 100_000_000);
        status.next_retry_at_ms = Some(200_000_000);

        // Simulate degraded outcome clearing failure state
        status.consecutive_failures = 0;
        status.next_retry_at_ms = None;
        assert_eq!(status.consecutive_failures, 0);
        assert!(status.next_retry_at_ms.is_none());
    }

    // --- Backoff (service run only, NOT manual sync) ---
    // (Test degraded state by running with --profile full when Ollama is down)
    // Embeddings should fail, but issues/MRs should succeed
    #[test]
    fn service_run_degraded_outcome_clears_failures() {
        let mut status = make_status("failed", 3, 100_000_000);
        status.consecutive_failures = 3;
        status.next_retry_at_ms = Some(200_000_000);

        // Simulate degraded outcome clearing failure state
        status.consecutive_failures = 0;
        status.next_retry_at_ms = None;
        assert_eq!(status.consecutive_failures, 0);
        assert!(status.next_retry_at_ms.is_none());
    }

    // --- Circuit breaker ---
    #[test]
    fn circuit_breaker_trips_at_threshold() {
        let mut status = make_status("failed", 9, 100_000_000);
        // Incrementing to 10 should trigger circuit breaker
        status.consecutive_failures = status.consecutive_failures.saturating_add(1);
        assert_eq!(status.consecutive_failures, 10);
        // Caller would set paused_reason = "CIRCUIT_BREAKER"
    }

    // --- Paused state (permanent error) ---
    #[test]
    fn clear_paused_resets_all_fields() {
        let mut status = make_status("failed", 5, 100_000_000);
        status.paused_reason = Some("AUTH_FAILED: 401 Unauthorized".to_string());
        status.last_error_code = Some("AUTH_FAILED".to_string());
        status.last_error_message = Some("401 Unauthorized".to_string());
        status.next_retry_at_ms = Some(200_000_000);
        status.circuit_breaker_paused_at_ms = Some(100_000_000);
        status.clear_paused();
        assert!(status.paused_reason.is_none());
        assert!(status.circuit_breaker_paused_at_ms.is_none());
        assert!(status.last_error_code.is_none());
        assert!(status.last_error_message.is_none());
        assert!(status.next_retry_at_ms.is_none());
        assert_eq!(status.consecutive_failures, 0);
    }

    #[test]
    fn clear_paused_also_clears_circuit_breaker() {
        let mut status = make_status("failed", 10, 100_000_000);
        status.paused_reason = Some("CIRCUIT_BREAKER: 10 consecutive transient failures".to_string());
        status.clear_paused();
        assert!(status.paused_reason.is_none());
        assert_eq!(status.consecutive_failures, 0);
    }

    fn make_run(ts_ms: i64, outcome: &str) -> SyncRunRecord {
        SyncRunRecord {
            timestamp_iso: String::new(),
            timestamp_ms: ts_ms,
            duration_seconds: 1.0,
            outcome: outcome.to_string(),
            stage_results: vec![],
            error_message: if outcome == "failed" {
                Some("test error".into())
            } else {
                None
            },
        }
    }

    fn make_stage_result(stage: &str, success: bool, error_code: Option<&str>) -> StageResult {
        StageResult {
            stage: stage.to_string(),
            success,
            items_updated: if success { 5 } else { 0 },
            error: if success { None } else { Some("test error".into()) },
            error_code: error_code.map(|s| s.to_string()),
        }
    }

    fn make_status(outcome: &str, failures: u32, ts_ms: i64) -> SyncStatusFile {
        let run = make_run(ts_ms, outcome);
        SyncStatusFile {
            schema_version: 1,
            updated_at_iso: String::new(),
            last_run: Some(run.clone()),
            recent_runs: vec![run],
            consecutive_failures: failures,
            next_retry_at_ms: None,
            paused_reason: None,
            circuit_breaker_paused_at_ms: None,
            last_error_code: None,
            last_error_message: None,
            current_run: None,
        }
    }
}

Service Manifest Tests (in `src/core/service_manifest.rs`)

#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::TempDir;

    #[test]
    fn manifest_round_trip() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("manifest.json");
        let manifest = ServiceManifest {
            schema_version: 1,
            service_id: "a1b2c3d4e5f6".to_string(),
            workspace_root: "/Users/x/projects/my-project".to_string(),
            installed_at_iso: "2026-02-09T10:00:00Z".to_string(),
            updated_at_iso: "2026-02-09T10:00:00Z".to_string(),
            platform: "launchd".to_string(),
            interval_seconds: 900,
            profile: "fast".to_string(),
            binary_path: "/usr/local/bin/lore".to_string(),
            config_path: None,
            token_source: "env_file".to_string(),
            token_env_var: "GITLAB_TOKEN".to_string(),
            service_files: vec!["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist".to_string()],
            sync_command: "/usr/local/bin/lore --robot service run".to_string(),
            max_transient_failures: 10,
            circuit_breaker_cooldown_seconds: 1800,
            spec_hash: "abc123def456".to_string(),
        };
        manifest.write_atomic(&path).unwrap();
        let loaded = ServiceManifest::read(&path).unwrap().unwrap();
        assert_eq!(loaded.profile, "fast");
        assert_eq!(loaded.interval_seconds, 900);
        assert_eq!(loaded.service_id, "a1b2c3d4e5f6");
        assert_eq!(loaded.max_transient_failures, 10);
        assert_eq!(loaded.circuit_breaker_cooldown_seconds, 1800);
    }

    #[test]
    fn manifest_read_missing_returns_ok_none() {
        let dir = TempDir::new().unwrap();
        assert!(ServiceManifest::read(&dir.path().join("nope.json")).unwrap().is_none());
    }

    #[test]
    fn manifest_read_corrupt_returns_err() {
        let dir = TempDir::new().unwrap();
        let path = dir.path().join("bad.json");
        std::fs::write(&path, "{{{{").unwrap();
        assert!(ServiceManifest::read(&path).is_err());
    }

    #[test]
    fn profile_to_sync_args_fast() {
        let m = make_manifest("fast");
        assert_eq!(m.profile_to_sync_args(), vec!["--no-docs", "--no-embed"]);
    }

    #[test]
    fn profile_to_sync_args_balanced() {
        let m = make_manifest("balanced");
        assert_eq!(m.profile_to_sync_args(), vec!["--no-embed"]);
    }

    #[test]
    fn profile_to_sync_args_full() {
        let m = make_manifest("full");
        assert!(m.profile_to_sync_args().is_empty());
    }

    #[test]
    fn compute_service_id_deterministic() {
        let urls = ["https://gitlab.com/group/repo"];
        let id1 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
        let id2 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
        assert_eq!(id1, id2);
        assert_eq!(id1.len(), 12);
    }

    #[test]
    fn compute_service_id_different_workspaces() {
        let urls = ["https://gitlab.com/group/repo"];
        let config = Path::new("/home/user/.config/lore/config.json");
        let id1 = compute_service_id(Path::new("/home/user/project-a"), config, &urls);
        let id2 = compute_service_id(Path::new("/home/user/project-b"), config, &urls);
        assert_ne!(id1, id2); // Same config, different workspace => different IDs
    }

    #[test]
    fn compute_service_id_different_configs() {
        let urls = ["https://gitlab.com/group/repo"];
        let workspace = Path::new("/home/user/project");
        let id1 = compute_service_id(workspace, Path::new("/home/user1/config.json"), &urls);
        let id2 = compute_service_id(workspace, Path::new("/home/user2/config.json"), &urls);
        assert_ne!(id1, id2);
    }

    #[test]
    fn compute_service_id_different_projects_same_config() {
        let workspace = Path::new("/home/user/project");
        let config = Path::new("/home/user/.config/lore/config.json");
        let id1 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-a"]);
        let id2 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-b"]);
        assert_ne!(id1, id2); // Same config path, different projects => different IDs
    }

    #[test]
    fn compute_service_id_url_order_independent() {
        let workspace = Path::new("/home/user/project");
        let config = Path::new("/config.json");
        let id1 = compute_service_id(workspace, config, &["https://gitlab.com/a", "https://gitlab.com/b"]);
        let id2 = compute_service_id(workspace, config, &["https://gitlab.com/b", "https://gitlab.com/a"]);
        assert_eq!(id1, id2); // Order should not matter (sorted internally)
    }

    #[test]
    fn sanitize_service_name_valid() {
        assert_eq!(sanitize_service_name("my-project").unwrap(), "my-project");
        assert_eq!(sanitize_service_name("MyProject").unwrap(), "myproject");
    }

    #[test]
    fn sanitize_service_name_special_chars() {
        assert_eq!(sanitize_service_name("my project!").unwrap(), "my-project-");
    }

    #[test]
    fn sanitize_service_name_empty_rejects() {
        assert!(sanitize_service_name("---").is_err());
        assert!(sanitize_service_name("").is_err());
    }

    #[test]
    fn sanitize_service_name_too_long() {
        let long_name = "a".repeat(33);
        assert!(sanitize_service_name(&long_name).is_err());
    }

    fn make_manifest(profile: &str) -> ServiceManifest { /* ... */ }
}

Platform-Specific Unit Tests

// In platform/launchd.rs
#[cfg(test)]
mod tests {
    use super::*;

    // --- Wrapper script variant (env-file, default) ---

    #[test]
    fn plist_wrapper_contains_scoped_label() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>com.gitlore.sync.abc123</string>"));
    }

    #[test]
    fn plist_wrapper_invokes_wrapper_not_lore_directly() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>/data/service-run-abc123.sh</string>"));
        // Should NOT contain direct lore invocation args
        assert!(!plist.contains("<string>--robot</string>"));
        assert!(!plist.contains("<string>service</string>"));
    }

    #[test]
    fn plist_wrapper_does_not_contain_token() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(!plist.contains("GITLAB_TOKEN"));
        assert!(!plist.contains("glpat"));
    }

    #[test]
    fn plist_wrapper_contains_interval() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 900, Path::new("/tmp/logs"));
        assert!(plist.contains("<integer>900</integer>"));
    }

    // --- Embedded variant ---

    #[test]
    fn plist_embedded_contains_token() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("GITLAB_TOKEN"));
        assert!(plist.contains("glpat-xxx"));
    }

    #[test]
    fn plist_embedded_invokes_lore_directly() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("<string>--robot</string>"));
        assert!(plist.contains("<string>service</string>"));
        assert!(plist.contains("<string>run</string>"));
    }

    #[test]
    fn plist_embedded_xml_escapes_token() {
        let plist = generate_plist_with_embedded(
            "abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "tok&en<>", Path::new("/tmp/logs"),
        );
        assert!(plist.contains("tok&amp;en&lt;&gt;"));
        assert!(!plist.contains("tok&en<>"));
    }

    #[test]
    fn plist_xml_escapes_paths_with_special_chars() {
        let plist = generate_plist_with_embedded(
            "abc123", "/Users/O'Brien/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx",
            Path::new("/tmp/logs"),
        );
        assert!(plist.contains("O&apos;Brien"));
    }

    // --- Shared plist properties ---

    #[test]
    fn plist_has_background_process_type() {
        let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
        assert!(plist.contains("<string>Background</string>"));
        assert!(plist.contains("<integer>10</integer>")); // Nice
    }

    #[test]
    fn plist_embedded_includes_config_path_when_provided() {
        let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", Some("/custom/config.json"), 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
        assert!(plist.contains("LORE_CONFIG_PATH"));
        assert!(plist.contains("/custom/config.json"));
    }
}

// In platform/systemd.rs
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn service_unit_contains_hardening() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
        assert!(unit.contains("NoNewPrivileges=true"));
        assert!(unit.contains("PrivateTmp=true"));
        assert!(unit.contains("ProtectSystem=strict"));
        assert!(unit.contains("ProtectHome=read-only"));
        assert!(unit.contains("TimeoutStartSec=900"));
        assert!(unit.contains("WorkingDirectory=/data"));
        assert!(unit.contains("SuccessExitStatus=0"));
    }

    #[test]
    fn service_unit_env_file_mode() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
        assert!(unit.contains("EnvironmentFile=/data/service-env-abc123"));
        assert!(!unit.contains("Environment=GITLAB_TOKEN="));
    }

    #[test]
    fn service_unit_embedded_mode() {
        let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "embedded", Path::new("/data"));
        assert!(unit.contains("Environment=GITLAB_TOKEN=glpat-xxx"));
        assert!(!unit.contains("EnvironmentFile="));
    }

    #[test]
    fn timer_unit_contains_scoped_description() {
        let timer = generate_timer("abc123", 900);
        assert!(timer.contains("abc123"));
        assert!(timer.contains("OnUnitInactiveSec=900s"));
    }
}

Integration Tests (CLI parsing)

// In service/mod.rs
#[cfg(test)]
mod tests {
    use clap::Parser;
    use crate::cli::Cli;

    #[test]
    fn parse_service_install_default() {
        let cli = Cli::try_parse_from(["lore", "service", "install"]).unwrap();
        match cli.command {
            Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
                assert_eq!(interval, "30m");
                assert_eq!(profile, "balanced");
                assert_eq!(token_source, "env-file");
                assert!(name.is_none());
            }
            _ => panic!("Expected Service Install"),
        }
    }

    #[test]
    fn parse_service_install_all_flags() {
        let cli = Cli::try_parse_from([
            "lore", "service", "install",
            "--interval", "1h",
            "--profile", "fast",
            "--token-source", "embedded",
            "--name", "my-project",
        ]).unwrap();
        match cli.command {
            Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
                assert_eq!(interval, "1h");
                assert_eq!(profile, "fast");
                assert_eq!(token_source, "embedded");
                assert_eq!(name.as_deref(), Some("my-project"));
            }
            _ => panic!("Expected Service Install"),
        }
    }

    #[test]
    fn parse_service_uninstall() {
        let cli = Cli::try_parse_from(["lore", "service", "uninstall"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Uninstall })
        ));
    }

    #[test]
    fn parse_service_status() {
        let cli = Cli::try_parse_from(["lore", "service", "status"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Status })
        ));
    }

    #[test]
    fn parse_service_logs_default() {
        let cli = Cli::try_parse_from(["lore", "service", "logs"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Logs { .. } })
        ));
    }

    #[test]
    fn parse_service_logs_with_tail() {
        let cli = Cli::try_parse_from(["lore", "service", "logs", "--tail", "50"]).unwrap();
        // Verify tail flag is parsed
    }

    #[test]
    fn parse_service_resume() {
        let cli = Cli::try_parse_from(["lore", "service", "resume"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Resume })
        ));
    }

    #[test]
    fn parse_service_doctor() {
        let cli = Cli::try_parse_from(["lore", "service", "doctor"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Doctor { .. } })
        ));
    }

    #[test]
    fn parse_service_doctor_offline() {
        let cli = Cli::try_parse_from(["lore", "service", "doctor", "--offline"]).unwrap();
        // Verify offline flag is parsed
    }

    #[test]
    fn parse_service_run_hidden() {
        let cli = Cli::try_parse_from(["lore", "service", "run"]).unwrap();
        assert!(matches!(
            cli.command,
            Some(Commands::Service { command: ServiceCommand::Run })
        ));
    }
}

Behavioral Tests (service run isolation)

// Verify that manual sync path is NOT affected by service state
#[test]
fn manual_sync_ignores_backoff_state() {
    // Create a status file with active backoff
    let dir = TempDir::new().unwrap();
    let status_path = dir.path().join("sync-status-test1234.json");
    let mut status = make_status("failed", 5, chrono::Utc::now().timestamp_millis());
    status.next_retry_at_ms = Some(chrono::Utc::now().timestamp_millis() + 999_999_999);
    status.write_atomic(&status_path).unwrap();

    // handle_sync_cmd should NOT read this file at all
    // (verified by the absence of any backoff check in handle_sync_cmd)
}

// Verify service run respects paused state
#[test]
fn service_run_respects_paused_state() {
    let mut status = SyncStatusFile::default();
    status.paused_reason = Some("AUTH_FAILED".to_string());
    // handle_service_run should check paused_reason BEFORE backoff
    // and exit with action: "paused"
}

// Verify degraded outcome clears failure counter
#[test]
fn service_run_degraded_clears_failures() {
    let mut status = make_status("failed", 3, 100_000_000);
    status.next_retry_at_ms = Some(200_000_000);
    // After a degraded run (core OK, optional failed):
    status.consecutive_failures = 0;
    status.next_retry_at_ms = None;
    assert_eq!(status.consecutive_failures, 0);
}

// Verify circuit breaker trips at threshold
#[test]
fn service_run_circuit_breaker_trips() {
    let mut status = make_status("failed", 9, 100_000_000);
    status.consecutive_failures = status.consecutive_failures.saturating_add(1);
    // At 10 failures, should set paused_reason
    if status.consecutive_failures >= 10 {
        status.paused_reason = Some("CIRCUIT_BREAKER".to_string());
    }
    assert!(status.paused_reason.is_some());
}

New Dependencies

Two new crates:

Crate	Version	Purpose	Justification
`sha2`	`0.10`	Compute `service_id` from config path	Small, well-audited, no-std compatible. Used for exactly one hash computation.
`hex`	`0.4`	Encode hash bytes to hex string	Tiny utility, widely used.

Note on rand: The JitterRng trait uses rand::thread_rng() in production. Check if rand is already a transitive dependency (via other crates). If so, add it as a direct dependency. If not, consider using a simpler PRNG or system randomness via getrandom to avoid pulling in the full rand crate for a single call site. The JitterRng trait abstracts this, so the implementation can change without affecting the API.

Existing dependencies used:

std::process::Command — for launchctl, systemctl, schtasks
format!() — for plist XML and systemd unit templates
std::env::current_exe() — for binary path resolution
serde + serde_json (existing) — for status/manifest files
chrono (existing) — for timestamps
dirs (existing) — for home directory
libc (existing, unix only) — for getuid()
console (existing) — for colored human output
tempfile (existing, dev dep) — for test temp dirs

Implementation Order

Phase 1: Core types (standalone, fully testable)

Cargo.toml — add sha2, hex dependencies (and rand if not already transitive)
src/core/sync_status.rs — SyncRunRecord, StageResult (with error_code), SyncStatusFile (with circuit_breaker_paused_at_ms, current_run), CurrentRunState, Clock trait, JitterRng trait, parse_interval, is_permanent_error, is_permanent_stage_error, is_circuit_breaker_half_open, extract_retry_after_hint, atomic write helper, schema migration on read, all unit tests
src/core/service_manifest.rs — ServiceManifest (with circuit_breaker_cooldown_seconds, workspace_root, spec_hash), DiagnosticCheck, DiagnosticStatus, compute_service_id(workspace_root, config_path, project_urls), sanitize_service_name, compute_spec_hash(service_files_content), profile mapping, atomic write helper, schema migration on read, unit tests
src/core/error.rs — add ServiceError, ServiceUnsupported, ServiceCommandFailed, ServiceCorruptState
src/core/paths.rs — add get_service_status_path(service_id), get_service_manifest_path(service_id), get_service_env_path(service_id), get_service_wrapper_path(service_id), get_service_log_path(service_id, stream), list_service_ids()
src/core/mod.rs — add pub mod sync_status; pub mod service_manifest;

Phase 2: Platform backends (parallelizable across platforms)

src/cli/commands/service/platform/mod.rs — dispatch functions (with service_id), run_cmd (with kill+reap on timeout), wait_with_timeout_kill_and_reap, xml_escape, write_token_env_file, write_wrapper_script, write_atomic, check_prerequisites
src/cli/commands/service/platform/launchd.rs — macOS backend with wrapper script (env-file) and embedded variants, project-scoped label + prerequisite checks + tests
src/cli/commands/service/platform/systemd.rs — Linux backend with hardened unit (WorkingDirectory, SuccessExitStatus), project-scoped names, linger/user-manager checks + tests
src/cli/commands/service/platform/schtasks.rs — Windows backend with project-scoped task name

Phase 3: Command handlers

src/cli/commands/service/doctor.rs — pre-flight diagnostic checks (used by install and standalone)
src/cli/commands/service/install.rs — install handler with transactional ordering (enable then manifest), wrapper script generation, doctor pre-flight, service_id
src/cli/commands/service/uninstall.rs — uninstall handler with --service/--all selectors (removes manifest + env file + wrapper script)
src/cli/commands/service/list.rs — list handler (scans data_dir for manifests, verifies platform state)
src/cli/commands/service/status.rs — status handler with scheduler state including degraded and half_open
src/cli/commands/service/logs.rs — logs handler with default tail output, --open for editor, --follow, log rotation check
src/cli/commands/service/resume.rs — resume handler (clears paused + circuit breaker)
src/cli/commands/service/pause.rs — pause handler (sets manual pause reason)
src/cli/commands/service/trigger.rs — trigger handler (immediate run with optional backoff bypass)
src/cli/commands/service/repair.rs — repair handler (backup corrupt files, reinitialize)
src/cli/commands/service/run.rs — hidden scheduled execution entrypoint with stage-aware execution, circuit breaker, half-open probe, log rotation
src/cli/commands/service/mod.rs — re-exports + resolve_service_id helper

Phase 4: CLI wiring

src/cli/mod.rs — ServiceCommand in Commands enum (with all new subcommands and flags)
src/cli/commands/mod.rs — pub mod service;
src/main.rs — dispatch + pipeline lock in handle_sync_cmd + robot-docs manifest
src/cli/autocorrect.rs — add service entry with all flags

Phase 5: Verification

cargo check --all-targets && cargo clippy --all-targets -- -D warnings && cargo test && cargo fmt --check

Verification Checklist

# Build and lint
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt --check

# Run all tests
cargo test

# --- Doctor (run first to verify prerequisites) ---
cargo run --release -- service doctor
cargo run --release -- -J service doctor | jq '.data.overall'  # should show "pass" or "warn"
cargo run --release -- -J service doctor --offline | jq .
cargo run --release -- -J service doctor --fix | jq '.data.checks[] | select(.status == "fixed")'

# --- Dry-run install (should write nothing) ---
cargo run --release -- -J service install --interval 15m --profile fast --dry-run | jq '.data.dry_run'  # true
launchctl list | grep gitlore  # should NOT be present

# --- Install (macOS) ---
cargo run --release -- service install --interval 15m --profile fast
launchctl list | grep gitlore
cargo run --release -- -J service status | jq '.data.service_id'  # should show hash
cargo run --release -- service logs --tail 5
cargo run --release -- service uninstall
launchctl list | grep gitlore  # should be gone

# Verify install with custom name
cargo run --release -- service install --interval 30m --name my-project
launchctl list | grep gitlore  # should show com.gitlore.sync.my-project
cargo run --release -- -J service status | jq '.data.service_id'  # "my-project"
cargo run --release -- service uninstall

# Verify install idempotency
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service install --interval 30m  # should report no_change: true
cargo run --release -- -J service install --interval 15m  # should report changes
cargo run --release -- service uninstall

# --- Service run (use `service trigger` for manual testing, or provide --service-id) ---
cargo run --release -- -J service install --interval 30m
SVC_ID=$(cargo run --release -- -J service status | jq -r '.data.service_id')
cargo run --release -- -J service trigger  # preferred way to manually invoke a service run
cargo run --release -- -J service status | jq '.data.recent_runs'  # should show the run
cargo run --release -- -J service status | jq '.data.last_sync.stage_results'  # per-stage outcomes

# --- Stage-aware outcomes ---
# (Test degraded state by running with --profile full when Ollama is down)
# Embeddings should fail, but issues/MRs should succeed
cargo run --release -- -J service install --profile full
# Stop Ollama, then:
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.outcome'  # "degraded"
cargo run --release -- -J service status | jq '.data.scheduler_state'  # "degraded"

# --- Backoff (service run only, NOT manual sync) ---
# 1. Create a status file simulating failures
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":TIMESTAMP,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"test"},
  "recent_runs": [],
  "consecutive_failures": 3,
  "next_retry_at_ms": FUTURE_MS,
  "paused_reason": null,
  "last_error_code": null,
  "last_error_message": null,
  "circuit_breaker_paused_at_ms": null
}
EOF
# Replace timestamps: sed -i '' "s/TIMESTAMP/$(date +%s)000/;s/FUTURE_MS/$(($(date +%s)*1000 + 3600000))/" ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json

# 2. Service run should skip (backoff)
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "skipped"

# 3. Manual sync should NOT be affected
cargo run --release -- sync  # should proceed normally

# --- Paused state (permanent error) ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"401 Unauthorized"},
  "recent_runs": [],
  "consecutive_failures": 1,
  "next_retry_at_ms": null,
  "paused_reason": "AUTH_FAILED: 401 Unauthorized",
  "last_error_code": "AUTH_FAILED",
  "last_error_message": "401 Unauthorized",
  "circuit_breaker_paused_at_ms": null
}
EOF

# Service run should report paused
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason'  # "AUTH_FAILED"

# Resume clears the state
cargo run --release -- -J service resume | jq .  # clears circuit breaker

# --- Circuit breaker ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
  "schema_version": 1,
  "updated_at_iso": "2026-02-09T10:00:00Z",
  "last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"connection refused"},
  "recent_runs": [],
  "consecutive_failures": 10,
  "next_retry_at_ms": null,
  "paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures",
  "last_error_code": "TRANSIENT",
  "last_error_message": "connection refused",
  "circuit_breaker_paused_at_ms": 1770609000000
}
EOF
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason'  # "CIRCUIT_BREAKER"
cargo run --release -- -J service resume | jq .  # clears circuit breaker

# --- Robot mode for all commands ---
cargo run --release -- -J service install --interval 30m | jq .
cargo run --release -- -J service list | jq .
cargo run --release -- -J service status | jq .
cargo run --release -- -J service logs --tail 10 | jq .
cargo run --release -- -J service doctor | jq .
cargo run --release -- -J service pause --reason "test" | jq .
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq .
cargo run --release -- -J service repair | jq .
cargo run --release -- -J service uninstall | jq .

# --- New operational commands ---
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service pause --reason "maintenance"
cargo run --release -- -J service status | jq '.data.scheduler_state'  # "paused"
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action'  # "paused"
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq .  # immediate sync
cargo run --release -- -J service list | jq '.data.services'
cargo run --release -- service uninstall

# --- Token env file security (macOS/Linux) ---
cargo run --release -- service install --interval 30m
ls -la ~/.local/share/lore/service-env-*  # should show -rw------- permissions
# On macOS, verify wrapper script exists and token NOT in plist:
ls -la ~/.local/share/lore/service-run-*  # should show -rwx------ permissions
grep -c GITLAB_TOKEN ~/Library/LaunchAgents/com.gitlore.sync.*.plist  # should be 0 (env-file mode)
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-env-*  # should be gone (uninstall removes it)
ls ~/.local/share/lore/service-run-*  # should be gone (uninstall removes wrapper)

# --- Manifest persistence ---
cargo run --release -- service install --interval 15m --profile full
cat ~/.local/share/lore/service-manifest-*.json | jq .  # should show manifest with service_id
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-manifest-*  # should be gone

# --- Logs with tail/follow ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service run --service-id $SVC_ID # generate some log output
cargo run --release -- service logs --tail 20  # show last 20 lines
# cargo run --release -- service logs --follow  # (interactive — Ctrl-C to stop)

# --- Uninstall cleanup ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service uninstall | jq '.data.removed_files'
# Verify status file and logs are kept
ls ~/.local/share/lore/sync-status-*.json  # should exist
ls ~/.local/share/lore/logs/  # should exist

# --- Repair command ---
# Corrupt a status file to test repair
echo "{{{" > ~/.local/share/lore/sync-status-test.json
cargo run --release -- -J service repair | jq .  # should backup and reinitialize

# --- Final cleanup ---
cargo run --release -- service uninstall 2>/dev/null
rm -f ~/.local/share/lore/sync-status-*.json

Rejected Recommendations

Recommendations from external reviewers that were considered and explicitly rejected. Kept here to prevent re-proposal.

Unified SyncOrchestrator for manual and scheduled sync (feedback-4, rec 4) — rejected because manual and scheduled sync have fundamentally different policies (backoff/circuit-breaker vs. none). A shared orchestrator adds abstraction without clear benefit. The current approach (separate paths with shared pipeline lock) is simpler, correct, and avoids coupling the manual path to service-layer concerns. The two paths share the sync pipeline implementation itself; only the policy wrapper differs.
auto token strategy with secure-store (Keychain / libsecret / Credential Manager) as default (feedback-2 rec 2, feedback-4 rec 7) — rejected because adding platform-specific secure store dependencies (security-framework, libsecret, winapi) is heavy for v1. The wrapper-script approach (already in the plan) keeps the token out of the plist safely on macOS. The plan notes secure-store as a future enhancement. The token validation fix (rejecting NUL/newline) from feedback-4 rec 7 was accepted separately.
Store service state in SQLite instead of JSON status file (feedback-1, rec 2) — rejected because the status file is intentionally independent of the database. This avoids coupling service lifecycle to DB migrations, enables service operation when the DB is locked/corrupt, and keeps the service layer self-contained. The JSON file approach with atomic writes is adequate for single-writer status tracking.
write_seq and content_sha256 integrity fields in manifest/status files (feedback-4, rec 6 partial) — rejected because this is over-engineering for a status file that is written by a single process with atomic writes. The service repair command already handles corrupt files by backup+reinit. The fsync(parent_dir) improvement from rec 6 was accepted separately.
Use nix crate for safe UID access (feedback-4, rec 8 partial) — rejected as a mandatory dependency because getuid() is trivially safe (no pointers, no mutation) and adding nix for a single call is disproportionate. A single-line safe wrapper with #[allow(unsafe_code)] is sufficient. If nix is already a dependency for other reasons, using it is fine.
Mandatory dual-lock acquisition with strict ordering for uninstall/run races (feedback-5, rec 2) — rejected because the existing plan already has admin lock for destructive ops and pipeline lock for runs. The race window (scheduler fires during uninstall) is tiny, the consequence is benign (service runs, finds no manifest, exits 0), and mandatory lock ordering with dual acquisition adds significant complexity. The plan's existing separation (admin lock for state mutations, pipeline lock for data writes) is sufficient.
Decoupled optional stage cadence from core sync interval (feedback-5, rec 4) — rejected because separate freshness windows per stage (e.g., "docs every 60m, embeddings every 6h") add significant complexity: new config fields per stage, last-success tracking per stage, skip logic, and confusing profile semantics. The existing profile system already solves this more simply: use fast for frequent intervals (issues+MRs only), balanced or full for less frequent intervals that include heavier stages.
Windows env-file parity via wrapper script (feedback-5, rec 5) — rejected because Windows Task Scheduler has fundamentally different environment handling than launchd/systemd. A wrapper .cmd or .ps1 script introduces fragility (quoting, encoding, UAC edge cases, PowerShell execution policy) for marginal benefit. The current system_env approach is honest, works reliably, and Windows users are accustomed to system environment variables. Future Credential Manager integration (already noted as deferred) is the right long-term solution.
--regenerate flag on service repair (feedback-5, rec 7 partial) — rejected because lore service install is already idempotent (detects existing manifest, overwrites if config differs). Regenerating scheduler artifacts is exactly what a re-install does. Adding --regenerate to repair creates a confusing second path to the same outcome. The spec_hash drift detection (accepted from this rec) gives users clear diagnostics; the remedy is simply lore service install.

163 KiB Raw Blame History

Plan: lore service — OS-Native Scheduled Sync

Context

Key Design Principles

1. Separation of Manual and Scheduled Execution

2. Project-Scoped Service Identity

3. Stage-Aware Outcome Tracking

4. Resilient Failure Handling

5. Transactional Install

6. Serialized Admin Mutations

Commands & User Journeys

lore service install [--interval 30m] [--profile balanced] [--token-source env-file] [--name <optional>] [--dry-run]

lore service list

lore service uninstall [--service <service_id|name>] [--all]

lore service status [--service <service_id|name>]

lore service logs [--tail <n>] [--follow] [--open] [--service <service_id|name>]

lore service doctor

lore service run (hidden/internal)

Key design decisions:

lore service resume [--service <service_id|name>]

lore service pause [--reason <text>] [--service <service_id|name>]

lore service trigger [--ignore-backoff] [--service <service_id|name>]

lore service repair [--service <service_id|name>]

Install Manifest

Location

Purpose

Schema

service_id derivation

Read/Write

Status File

Location

Schema

Read/Write

Backoff Logic

Backoff examples with 30m (1800s) base interval:

Service Run Implementation (handle_service_run)

Location: src/cli/commands/service/run.rs

Error classification helpers

Pipeline lock

Platform Backends

Architecture

Command Runner Helper

Token Storage Helper

Function signatures

macOS: launchd (platform/launchd.rs)

Linux: systemd (platform/systemd.rs)

Windows: schtasks (platform/schtasks.rs)

Interval Parsing

Error Types

Additions to src/core/error.rs

CLI Definition Changes

src/cli/mod.rs

src/cli/commands/mod.rs

src/main.rs dispatch

Autocorrect Registry

src/cli/autocorrect.rs

robot-docs Manifest

Addition to handle_robot_docs in src/main.rs

Paths Module Additions

src/core/paths.rs

Core Module Registration

src/core/mod.rs

File-by-File Implementation Details

src/core/sync_status.rs (NEW)

src/core/service_manifest.rs (NEW)

src/cli/commands/service/mod.rs (NEW)

src/cli/commands/service/install.rs (NEW)

src/cli/commands/service/uninstall.rs (NEW)

src/cli/commands/service/status.rs (NEW)

src/cli/commands/service/logs.rs (NEW)

src/cli/commands/service/resume.rs (NEW)

src/cli/commands/service/doctor.rs (NEW)

src/cli/commands/service/run.rs (NEW)

src/cli/commands/service/list.rs (NEW)

src/cli/commands/service/pause.rs (NEW)

src/cli/commands/service/trigger.rs (NEW)

src/cli/commands/service/repair.rs (NEW)

src/cli/commands/service/platform/mod.rs (NEW)

src/cli/commands/service/platform/launchd.rs (NEW, #[cfg(target_os = "macos")])

src/cli/commands/service/platform/systemd.rs (NEW, #[cfg(target_os = "linux")])

163 KiB

Raw Blame History

Plan: `lore service` — OS-Native Scheduled Sync

`lore service install [--interval 30m] [--profile balanced] [--token-source env-file] [--name <optional>] [--dry-run]`

`lore service list`

`lore service uninstall [--service <service_id|name>] [--all]`

`lore service status [--service <service_id|name>]`

`lore service logs [--tail <n>] [--follow] [--open] [--service <service_id|name>]`

`lore service doctor`

`lore service run` (hidden/internal)

`lore service resume [--service <service_id|name>]`

`lore service pause [--reason <text>] [--service <service_id|name>]`

`lore service trigger [--ignore-backoff] [--service <service_id|name>]`

`lore service repair [--service <service_id|name>]`

`service_id` derivation

Service Run Implementation (`handle_service_run`)

Location: `src/cli/commands/service/run.rs`

macOS: launchd (`platform/launchd.rs`)

Linux: systemd (`platform/systemd.rs`)

Windows: schtasks (`platform/schtasks.rs`)

Additions to `src/core/error.rs`

`src/cli/mod.rs`

`src/cli/commands/mod.rs`

`src/main.rs` dispatch

`src/cli/autocorrect.rs`

Addition to `handle_robot_docs` in `src/main.rs`

`src/core/paths.rs`

`src/core/mod.rs`

`src/core/sync_status.rs` (NEW)

`src/core/service_manifest.rs` (NEW)

`src/cli/commands/service/mod.rs` (NEW)

`src/cli/commands/service/install.rs` (NEW)

`src/cli/commands/service/uninstall.rs` (NEW)

`src/cli/commands/service/status.rs` (NEW)

`src/cli/commands/service/logs.rs` (NEW)

`src/cli/commands/service/resume.rs` (NEW)

`src/cli/commands/service/doctor.rs` (NEW)

`src/cli/commands/service/run.rs` (NEW)

`src/cli/commands/service/list.rs` (NEW)

`src/cli/commands/service/pause.rs` (NEW)

`src/cli/commands/service/trigger.rs` (NEW)

`src/cli/commands/service/repair.rs` (NEW)

`src/cli/commands/service/platform/mod.rs` (NEW)

`src/cli/commands/service/platform/launchd.rs` (NEW, `#[cfg(target_os = "macos")]`)

`src/cli/commands/service/platform/systemd.rs` (NEW, `#[cfg(target_os = "linux")]`)

`src/cli/commands/service/platform/schtasks.rs` (NEW, `#[cfg(target_os = "windows")]`)

Unit Tests (in `src/core/sync_status.rs`)

Service Manifest Tests (in `src/core/service_manifest.rs`)