Three implementation plans with iterative cross-model refinement: lore-service (5 iterations): HTTP service layer exposing lore's SQLite data via REST/SSE for integration with external tools (dashboards, IDE extensions, chat agents). Covers authentication, rate limiting, caching strategy, and webhook-driven sync triggers. work-item-status-graphql (7 iterations + TDD appendix): Detailed implementation plan for the GraphQL-based work item status enrichment feature (now implemented). Includes the TDD appendix with test-first development specifications covering GraphQL client, adaptive pagination, ingestion orchestration, CLI display, and robot mode output. time-decay-expert-scoring (iteration 5 feedback): Updates to the existing time-decay scoring plan incorporating feedback on decay curve parameterization, recency weighting for discussion contributions, and staleness detection thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
163 KiB
plan, title, status, iteration, target_iterations, beads_revision, related_plans, created, updated
| plan | title | status | iteration | target_iterations | beads_revision | related_plans | created | updated |
|---|---|---|---|---|---|---|---|---|
| true | iterating | 5 | 8 | 0 | 2026-02-09 | 2026-02-11 |
Plan: lore service — OS-Native Scheduled Sync
Context
lore sync runs a 4-stage pipeline (issues, MRs, docs, embeddings) that takes 2-4 minutes. Today it must be invoked manually. We want lore service install to set up OS-native scheduled execution automatically, with exponential backoff on failures, a circuit breaker for persistent transient errors, stage-aware outcome tracking, and a status file for observability. This is the first nested subcommand in the project.
Key Design Principles
1. Separation of Manual and Scheduled Execution
lore sync remains the manual/operator command. It is never subject to backoff, pausing, or service-level policy. A separate hidden entrypoint — lore service run --service-id <id> — is what the OS scheduler actually invokes. This entrypoint applies service-specific policy (backoff, error classification, pipeline locking) before delegating to the sync pipeline. This separation ensures that a human running lore sync to debug or recover is never unexpectedly blocked by service state. The --service-id parameter ensures unambiguous manifest/status file selection when multiple services are installed.
2. Project-Scoped Service Identity
Each installed service gets a unique service_id derived from a canonical identity tuple: the workspace root, config file path, and sorted GitLab project URLs. This composite fingerprint prevents collisions even when multiple workspaces share a single global config file — the identity represents what is being synced and where, not just the config location. The hash uses 12 hex characters (48 bits) for collision safety. An optional --name flag allows explicit naming for human readability; if --name collides with an existing service that has a different identity hash (different workspace/config/projects), install fails with an actionable error listing the conflict.
3. Stage-Aware Outcome Tracking
The sync pipeline has stages of differing criticality. Issues and MRs are core — their failure constitutes a hard failure. Docs and embeddings are optional — their failure produces a degraded outcome but does not trigger backoff or pause. This ensures data freshness for the most important entities even when peripheral stages have transient problems.
4. Resilient Failure Handling
Errors are classified as transient (retry with backoff) or permanent (pause until user intervention). A circuit breaker trips after a configurable number of consecutive transient failures (default: 10), transitioning to a half_open probe state after a cooldown period (default: 30 minutes). In half_open, one trial run is allowed — if it succeeds, the breaker closes automatically; if it fails, the breaker returns to paused state requiring manual lore service resume. This provides self-healing for systemic but recoverable failures (DNS outages, temporary GitLab maintenance) while still halting on truly persistent problems.
5. Transactional Install
The install process is two-phase: service files are generated and the platform-specific enable command is run first. Only on success is the install manifest written atomically. If the enable command fails, generated files are cleaned up and no manifest is persisted. This prevents a false "installed" state when the scheduler rejects the service configuration.
6. Serialized Admin Mutations
All commands that mutate service state (install, uninstall, pause, resume, repair) acquire an admin-level lock — AppLock("service-admin-{service_id}") — before reading or writing manifest/status files. This prevents races between concurrent admin commands (e.g., a user running service pause while an automated tool runs service resume). The admin lock is separate from the sync_pipeline lock, which guards the data pipeline. Legal state transitions:
idle->running->success|degraded|backoff|pausedbackoff->running|pausedpaused->half_open|running(viaresume)half_open->running|paused
Any transition not in this table is rejected with ServiceCorruptState. The service run entrypoint does NOT acquire the admin lock — it only acquires the sync_pipeline lock to avoid overlapping data writes.
Commands & User Journeys
lore service install [--interval 30m] [--profile balanced] [--token-source env-file] [--name <optional>] [--dry-run]
What it does: Generates and installs an OS-native scheduled task that runs lore --robot service run --service-id <service_id> at the specified interval, with the chosen sync profile, token storage strategy, and a project-scoped identity to avoid collisions across workspaces.
User journey:
- User runs
lore service install --interval 15m --profile fast - CLI loads config to read
gitlab.tokenEnvVar(default:GITLAB_TOKEN) - CLI resolves the token value from the current environment
- CLI computes or reads
service_id:- If
--nameis provided, use it (sanitized to[a-z0-9-]) - Otherwise, derive from a composite fingerprint of (workspace root + config path + sorted project URLs) — first 12 hex chars of SHA-256
- This becomes the suffix for all platform-specific identifiers (launchd label, systemd unit name, Windows task name)
- If
- CLI resolves its own binary path via
std::env::current_exe()?.canonicalize()? - CLI writes the token to a user-private env file (
{data_dir}/service-env-{service_id}, mode 0600) unless--token-source embeddedis explicitly passed - CLI generates the platform-specific service files (referencing
lore --robot service run --service-id <service_id>, NOTlore sync) - CLI writes service files to disk
- CLI runs the platform-specific enable command
- On success: CLI writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) to
{data_dir}/service-manifest-{service_id}.json - On failure: CLI removes generated service files, env file, wrapper script, and temp manifest — returns
ServiceCommandFailedwith stderr context - CLI outputs success with details of what was installed
Sync profiles:
| Profile | Sync flags | Use case |
|---|---|---|
fast |
--no-docs --no-embed |
Minimal: issues + MRs only |
balanced (default) |
--no-embed |
Issues + MRs + doc generation |
full |
(none) | Full pipeline including embeddings |
The profile determines what flags are passed to the underlying sync command. The scheduler invocation is always lore --robot service run --service-id <service_id>, which reads the profile from the install manifest and constructs the appropriate sync flags.
Token storage strategies:
| Strategy | Behavior | Security | Platforms |
|---|---|---|---|
env-file (default) |
Token written to {data_dir}/service-env-{service_id} with 0600 permissions. On Linux/systemd, referenced via EnvironmentFile= (true file-based loading). On macOS/launchd, a wrapper shell script (mode 0700) sources the env file at runtime and execs lore — the token never appears in the plist. |
Token file only readable by owner. Canonical source is the env file; lore service install re-reads it on regeneration. |
macOS, Linux |
embedded |
Token embedded directly in service file. Requires explicit --token-source embedded flag. CLI prints a security warning. |
Less secure: token visible in plist/unit file. | macOS, Linux |
On Windows, neither strategy applies — the token must be in the user's system environment (set via setx or system settings). token_source is reported as "system_env". This is documented as a requirement in lore service install output on Windows.
Note on macOS wrapper script approach: launchd cannot natively load environment files. Rather than embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script (
{data_dir}/service-run-{service_id}.sh, mode 0700) that sources the env file and execslore. The plist'sProgramArgumentspoints to the wrapper script, keeping the token out of the plist entirely. On Linux/systemd,EnvironmentFile=provides native file-based loading without any wrapper needed.Future enhancement: On macOS, Keychain integration could eliminate the env file entirely. On Windows, Credential Manager could replace the system environment requirement. These are deferred to a future iteration to avoid adding platform-specific secure store dependencies (
security-framework,winapi) in v1.
Acceptance criteria:
- Parses interval strings:
5m,15m,30m,1h,2h,12h,24h - Rejects intervals < 5 minutes or > 24 hours
- Rejects non-numeric or malformed intervals with clear error messages
- Computes
service_idfrom composite fingerprint (workspace root + config path + project URLs) or--nameflag; sanitizes to[a-z0-9-]. If--namecollides with an existing service with a different identity hash, returns an actionable error. - If already installed (manifest exists for this
service_id): reads existing manifest. If config matches, reportsno_change: true. If config differs, overwrites and reports what changed. - If
GITLAB_TOKEN(or configured env var) is not set, fails withTokenNotSeterror - If
current_exe()fails, returnsServiceError - Creates parent directories for service files if they don't exist
- Writes install manifest atomically (tmp file + fsync(file) + rename + fsync(parent_dir)) alongside service files
- Runs
service doctorchecks as a pre-flight: validates scheduler prerequisites (e.g., systemd user manager/linger on Linux, GUI session context on macOS) and surfaces warnings or errors before installing --dry-run: validates config/token/prereqs, renders service files and planned commands, but writes nothing and executes nothing. Robot output includes"dry_run": trueand the rendered service file content for inspection.- Robot mode outputs
{"ok":true,"data":{...},"meta":{"elapsed_ms":N}} - Human mode outputs a clear summary with file paths and next steps
Robot output:
{
"ok": true,
"data": {
"platform": "launchd",
"service_id": "a1b2c3d4e5f6",
"interval_seconds": 900,
"profile": "fast",
"binary_path": "/usr/local/bin/lore",
"config_path": null,
"service_files": ["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist"],
"sync_command": "/usr/local/bin/lore --robot service run --service-id a1b2c3d4e5f6",
"token_env_var": "GITLAB_TOKEN",
"token_source": "env_file",
"no_change": false
},
"meta": { "elapsed_ms": 42 }
}
Human output:
Service installed:
Platform: launchd
Service ID: a1b2c3d4e5f6
Interval: 15m (900s)
Profile: fast (--no-docs --no-embed)
Binary: /usr/local/bin/lore
Service: ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
Command: lore --robot service run --service-id a1b2c3d4e5f6
Token: stored in ~/.local/share/lore/service-env-a1b2c3d4e5f6 (0600)
To rotate your token: lore service install
lore service list
What it does: Lists all installed services discovered from {data_dir}/service-manifest-*.json files. Useful when managing multiple gitlore workspaces to see all active installations at a glance.
User journey:
- User runs
lore service list - CLI scans
{data_dir}for files matchingservice-manifest-*.json - Reads each manifest and verifies platform state
- Outputs summary of all installed services
Acceptance criteria:
- Returns empty list (not error) when no services installed
- Shows
service_id,platform,interval,profile,installed_at_isofor each - Verifies platform state matches manifest (flags drift)
- Robot and human output modes
Robot output:
{
"ok": true,
"data": {
"services": [
{
"service_id": "a1b2c3d4e5f6",
"platform": "launchd",
"interval_seconds": 900,
"profile": "fast",
"installed_at_iso": "2026-02-09T10:00:00Z",
"platform_state": "loaded",
"drift": false
}
]
},
"meta": { "elapsed_ms": 15 }
}
Human output:
Installed services:
a1b2c3d4e5f6 launchd 15m fast installed 2026-02-09 loaded
Or when none installed:
No services installed. Run: lore service install
lore service uninstall [--service <service_id|name>] [--all]
What it does: Disables and removes the scheduled task, its manifest, and its token env file.
User journey:
- User runs
lore service uninstall - CLI resolves target service: uses
--serviceif provided, otherwise derivesservice_idfrom current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services withlore service list. - If manifest doesn't exist, checks platform directly; if not installed, exits cleanly with informational message (exit 0, not an error)
- Runs platform-specific disable command
- Removes service files from disk
- Removes install manifest (
service-manifest-{service_id}.json) - Removes token env file (
service-env-{service_id}) if it exists - Does NOT remove the status file or log files (those are operational data, not config)
- Outputs confirmation
Acceptance criteria:
- Idempotent: running when not installed is not an error
- Removes ALL service files (timer + service on systemd), the install manifest, and the token env file
- Does NOT remove the status file or log files (those are data, not config)
- If platform disable command fails (e.g., service was already unloaded), still removes files and succeeds
- Robot and human output modes
Robot output:
{
"ok": true,
"data": {
"was_installed": true,
"service_id": "a1b2c3d4e5f6",
"platform": "launchd",
"removed_files": [
"/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist",
"/Users/x/.local/share/lore/service-manifest-a1b2c3d4e5f6.json",
"/Users/x/.local/share/lore/service-env-a1b2c3d4e5f6"
]
},
"meta": { "elapsed_ms": 15 }
}
Human output:
Service uninstalled (a1b2c3d4e5f6):
Removed: ~/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist
Removed: ~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json
Removed: ~/.local/share/lore/service-env-a1b2c3d4e5f6
Kept: ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json (run history)
Kept: ~/.local/share/lore/logs/ (service logs)
Or if not installed:
Service is not installed. Nothing to do.
lore service status [--service <service_id|name>]
What it does: Shows install state, scheduler state (running/backoff/paused/half_open/idle), last sync result, recent run history, and next run estimate. Resolves target service via --service flag or current-project-derived default.
User journey:
- User runs
lore service status - CLI resolves target service: uses
--serviceif provided, otherwise derivesservice_idfrom current project config. If multiple manifests exist and no selector is provided, returns an actionable error listing available services withlore service list. - CLI reads install manifest from
{data_dir}/service-manifest-{service_id}.json - If installed, verifies platform state matches manifest (detects drift)
- Reads
{data_dir}/sync-status-{service_id}.jsonfor last sync and recent run history - Queries platform for service state and next run time
- Computes scheduler state from status file + backoff logic
- Outputs combined status
Scheduler states:
idle— installed but no runs yetrunning— currently executing (sync_pipeline lock held,current_runmetadata present with recentstarted_at_ms)running_stale—current_runmetadata exists but the process (by PID) is no longer alive, orstarted_at_msis older than 30 minutes. Indicates a crashed or killed previous run.lore service statusreports this with the stale run's start time and PID for diagnostics.degraded— last run completed but one or more optional stages failed (docs/embeddings). Core data (issues/MRs) is fresh.backoff— transient failures, waiting to retryhalf_open— circuit breaker cooldown expired; one probe run is allowed. If it succeeds, the breaker closes automatically and state returns to normal. If it fails, state transitions topaused.paused— permanent error detected (bad token, config error) OR circuit breaker tripped and probe failed. Requires user intervention vialore service resume.not_installed— service not installed
Acceptance criteria:
- Works even if service is not installed (shows
installed: false,scheduler_state: "not_installed") - Works even if status file doesn't exist (shows
last_sync: null) - Shows backoff state with remaining time if in backoff
- Shows paused reason if in paused state
- Includes recent runs summary (last 5 runs)
- Shows next scheduled run if determinable from platform
- Detects drift at multiple levels:
- Platform drift: loaded/unloaded mismatch between manifest and OS scheduler
- Spec drift: SHA-256 hash of service file content on disk doesn't match
spec_hashin manifest (detects manual edits to plist/unit files) - Command drift: sync command in service file differs from manifest's
sync_command
- Exit code 0 always (status is informational)
Robot output:
{
"ok": true,
"data": {
"installed": true,
"service_id": "a1b2c3d4e5f6",
"platform": "launchd",
"interval_seconds": 1800,
"profile": "balanced",
"service_state": "loaded",
"scheduler_state": "running",
"last_sync": {
"timestamp_iso": "2026-02-09T10:30:00.000Z",
"duration_seconds": 12.5,
"outcome": "success",
"stage_results": [
{ "stage": "issues", "success": true, "items_updated": 5 },
{ "stage": "mrs", "success": true, "items_updated": 3 },
{ "stage": "docs", "success": true, "items_updated": 12 }
],
"consecutive_failures": 0
},
"recent_runs": [
{ "timestamp_iso": "2026-02-09T10:30:00Z", "outcome": "success", "duration_seconds": 12.5 },
{ "timestamp_iso": "2026-02-09T10:00:00Z", "outcome": "success", "duration_seconds": 11.8 }
],
"backoff": null,
"paused_reason": null,
"drift": {
"platform_drift": false,
"spec_drift": false,
"command_drift": false
}
},
"meta": { "elapsed_ms": 15 }
}
When degraded (optional stages failed):
"scheduler_state": "degraded",
"last_sync": {
"outcome": "degraded",
"stage_results": [
{ "stage": "issues", "success": true, "items_updated": 5 },
{ "stage": "mrs", "success": true, "items_updated": 3 },
{ "stage": "docs", "success": false, "error": "I/O error writing documents" }
]
}
When in backoff:
"scheduler_state": "backoff",
"backoff": {
"consecutive_failures": 3,
"next_retry_iso": "2026-02-09T14:30:00.000Z",
"remaining_seconds": 7200
}
When paused (permanent error):
"scheduler_state": "paused",
"paused_reason": "AUTH_FAILED: GitLab returned 401 Unauthorized. Run: lore service resume"
When paused (circuit breaker):
"scheduler_state": "paused",
"paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures (last: NetworkError). Run: lore service resume"
When in half-open (circuit breaker cooldown expired, probe pending):
"scheduler_state": "half_open",
"backoff": {
"consecutive_failures": 10,
"circuit_breaker_cooldown_expired": true,
"message": "Circuit breaker cooldown expired. Next run will be a probe attempt."
}
Human output:
Service status (a1b2c3d4e5f6):
Installed: yes
Platform: launchd
Interval: 30m (1800s)
Profile: balanced
State: loaded
Scheduler: running
Last sync:
Time: 2026-02-09 10:30:00 UTC
Duration: 12.5s
Outcome: success
Stages: issues (5), mrs (3), docs (12)
Failures: 0 consecutive
Recent runs (last 5):
10:30 UTC success 12.5s
10:00 UTC success 11.8s
When degraded:
Scheduler: DEGRADED
Core stages OK: issues (5), mrs (3)
Failed stages: docs (I/O error writing documents)
Core data is fresh. Optional stages will retry next run.
When paused (permanent error):
Scheduler: PAUSED - AUTH_FAILED
GitLab returned 401 Unauthorized
Fix: rotate token, then run: lore service resume
When paused (circuit breaker):
Scheduler: PAUSED - CIRCUIT_BREAKER
10 consecutive transient failures (last: NetworkError)
Fix: check network/GitLab availability, then run: lore service resume
When half-open (circuit breaker cooldown expired):
Scheduler: HALF_OPEN
Circuit breaker cooldown expired. Next run will probe.
If probe succeeds, scheduler returns to normal.
lore service logs [--tail <n>] [--follow] [--open] [--service <service_id|name>]
What it does: Displays or streams the service log file. By default, prints the last 100 lines to stdout. With --tail <n>, shows the last N lines. With --follow, streams new lines as they arrive (like tail -f). With --open, opens in the user's preferred editor.
User journey (default):
- User runs
lore service logs - CLI determines log path:
{data_dir}/logs/service-{service_id}-stderr.log - CLI checks if file exists; if not, outputs "No log file found yet" with the expected path
- Prints last 100 lines to stdout
User journey (--open):
- User runs
lore service logs --open - CLI determines editor:
$VISUAL->$EDITOR->less(Unix) /notepad(Windows) - Spawns editor as child process, waits for exit
- Exits with editor's exit code
User journey (--tail / --follow):
- User runs
lore service logs --tail 50orlore service logs --follow - CLI reads the last N lines or streams with follow
- Outputs directly to stdout
Log rotation: Rotate service-{service_id}-stdout.log and service-{service_id}-stderr.log at 10 MB, keeping 5 rotated files. Rotation is checked at 10 MB, not at every write. This avoids creating many small files and prevents log file explosion.
Acceptance criteria:
- Default (no flags): prints last 100 lines to stdout
--open: Falls back throughVISUAL->EDITOR->less->notepad. If no editor and nolessavailable, returnsServiceErrorwith suggestion.--tail <n>shows last N lines (default 100 if no value), exits immediately--followstreams new log lines until Ctrl-C (liketail -f); mutually exclusive with--open--tailand--followcan be combined: show last N lines then follow- In robot mode, outputs the log file path and optionally last N lines as JSON (never opens editor)
Robot output (does not open editor):
{
"ok": true,
"data": {
"log_path": "/Users/x/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log",
"exists": true,
"size_bytes": 4096,
"last_lines": ["2026-02-09T10:30:00Z sync completed in 12.5s", "..."]
},
"meta": { "elapsed_ms": 1 }
}
The last_lines field is included when --tail is specified in robot mode (capped at 100 lines to avoid bloated JSON). Without --tail, only path metadata is returned. --follow is not supported in robot mode (returns error: "follow mode requires interactive terminal").
lore service doctor
What it does: Validates that the service environment is healthy: scheduler prerequisites, token validity, file permissions, config accessibility, and platform-specific readiness.
User journey:
- User runs
lore service doctor(or it runs automatically as a pre-flight duringservice install) - CLI runs a series of diagnostic checks and reports pass/warn/fail for each
Diagnostic checks:
- Config accessible — Can load and parse
config.json - Token present — Configured env var is set and non-empty
- Token valid — Quick auth test against GitLab API (optional, skipped with
--offline) - Binary path —
current_exe()resolves and is executable - Data directory — Writable by current user
- Platform prerequisites:
- macOS: Running in a GUI login session (launchd bootstrap domain is
gui/{uid}, notsystem) - Linux:
systemctl --useris available; user manager is running;loginctl enable-lingeris active (required for timers to fire when user is not logged in) - Windows:
schtasksis available
- macOS: Running in a GUI login session (launchd bootstrap domain is
- Existing install — If manifest exists, verify platform state matches (drift detection)
Acceptance criteria:
- Each check reports:
pass,warn, orfail - Warnings are non-blocking (e.g., linger not enabled — timer works when logged in but not on reboot)
- Failures are blocking for
service install(install aborts with actionable message) --offlineskips network checks (token validation)--fixattempts safe, non-destructive remediations for fixable issues: create missing directories, correct file permissions on env/wrapper files (0600/0700), runsystemctl --user daemon-reloadwhen applicable. Reports each applied fix in the output. Does NOT attempt fixes that could cause data loss.- Exit code: 0 if all pass/warn, non-zero if any fail
Robot output:
{
"ok": true,
"data": {
"checks": [
{ "name": "config", "status": "pass" },
{ "name": "token_present", "status": "pass" },
{ "name": "token_valid", "status": "pass" },
{ "name": "binary_path", "status": "pass" },
{ "name": "data_directory", "status": "pass" },
{ "name": "platform_prerequisites", "status": "warn", "message": "loginctl linger not enabled; timer will not fire on reboot without active session", "action": "loginctl enable-linger $(whoami)" },
{ "name": "install_state", "status": "pass" }
],
"overall": "warn"
},
"meta": { "elapsed_ms": 850 }
}
Human output:
Service doctor:
[PASS] Config loaded from ~/.config/lore/config.json
[PASS] GITLAB_TOKEN is set
[PASS] GitLab authentication successful
[PASS] Binary: /usr/local/bin/lore
[PASS] Data dir: ~/.local/share/lore/ (writable)
[WARN] loginctl linger not enabled
Timer will not fire on reboot without active session
Fix: loginctl enable-linger $(whoami)
[PASS] No existing install detected
Overall: WARN (1 warning)
lore service run (hidden/internal)
What it does: Executes one scheduled sync attempt with full service-level policy. This is the command the OS scheduler actually invokes — users should never need to call it directly.
Invocation by scheduler: lore --robot service run --service-id <service_id>
Execution flow:
- Read install manifest for the given
service_idto determine profile, interval, and circuit breaker config - Read status file (service-scoped)
- If paused (not half_open): check if circuit breaker cooldown has expired. If cooldown expired, transition to
half_openand allow probe (continue to step 5). If cooldown still active or paused for permanent error, log reason, write status, exit 0. - If in backoff window: log skip reason, write status, exit 0
- Acquire
sync_pipelineAppLock(prevents overlap with manual sync or another scheduled run) - If lock acquisition fails (another sync running): log, exit 0
- Execute sync pipeline with flags derived from profile
- On success: reset
consecutive_failuresto 0, write status, release lock - On transient failure: increment
consecutive_failures, compute next backoff, write status, release lock - On permanent failure: set
paused_reason, write status, release lock
Stage-aware execution:
The sync pipeline is executed stage-by-stage, with each stage's outcome recorded independently:
| Stage | Criticality | Failure behavior | In-run retry |
|---|---|---|---|
issues |
core | Hard failure — triggers backoff/pause | 1 retry on transient errors (1-5s jittered delay) |
mrs |
core | Hard failure — triggers backoff/pause | 1 retry on transient errors (1-5s jittered delay) |
docs |
optional | Degraded outcome — logged but does not trigger backoff | No retry (best-effort) |
embeddings |
optional | Degraded outcome — logged but does not trigger backoff | No retry (best-effort) |
In-run retries for core stages: Before counting a core stage failure toward backoff/circuit-breaker, the service runner retries the stage once with a jittered delay of 1-5 seconds. This absorbs transient network blips (DNS hiccups, momentary 5xx responses) without extending run duration significantly. Only transient errors are retried — permanent errors (bad token, config errors) are never retried. If the retry succeeds, the stage is recorded as successful. If both attempts fail, the final error is used for classification. This significantly reduces false backoff triggers from brief network interruptions.
If all core stages succeed (potentially after retry) but optional stages fail, the run outcome is "degraded" — consecutive failures are NOT incremented, and the scheduler state reflects degraded rather than backoff. This ensures data freshness for the most important entities even when peripheral stages have transient problems.
Transient vs permanent error classification:
| Error type | Classification | Examples |
|---|---|---|
| Transient | Retry with backoff | Network timeout, DB locked, 5xx from GitLab |
| Transient (hinted) | Respect server retry hint | Rate limited with Retry-After or X-RateLimit-Reset header |
| Permanent | Pause until user action | 401 Unauthorized (bad token), config not found, config invalid, migration failed |
The classification is determined by the ErrorCode of the underlying LoreError:
- Permanent:
TokenNotSet,AuthFailed,ConfigNotFound,ConfigInvalid,MigrationFailed - Transient: everything else (
NetworkError,RateLimited,DbLocked,DbError,InternalError, etc.)
Key design decisions:
next_retry_at_msis computed once on failure and persisted —service statussimply reads it for stable, consistent display- Retry-After awareness: If a transient error includes a server-provided retry hint (e.g.,
Retry-Afterheader on 429 responses,X-RateLimit-Reseton GitLab rate limits), the backoff is set tomax(computed_backoff, hinted_retry_at). This prevents useless retries during rate-limit windows and respects GitLab's guidance. Thebackoff_reasonfield (if present) indicates whether the backoff was server-hinted. - Backoff base is the configured interval, not a hardcoded 1800s — a user with
--interval 5mgets shorter backoffs than one with--interval 1h - Optional stage failures produce
degradedoutcome without triggering backoff - Respects backoff window from previous failures (reads
next_retry_at_msfrom status file) - Pauses on permanent errors instead of burning retries
- Trips circuit breaker after 10 consecutive transient failures
- Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)
Circuit breaker (with half-open recovery):
After max_transient_failures consecutive transient failures (default: 10), the service transitions to paused state with reason CIRCUIT_BREAKER. However, instead of requiring manual intervention forever, the circuit breaker enters a half_open state after a cooldown period (circuit_breaker_cooldown_seconds, default: 1800 = 30 minutes).
In half_open, the next service run invocation is allowed to proceed as a probe:
- If the probe succeeds or returns
degraded, the circuit breaker closes automatically:consecutive_failuresresets to 0,paused_reasonis cleared, and normal operation resumes. - If the probe fails, the circuit breaker returns to
pausedstate with an updatedcircuit_breaker_paused_at_mstimestamp, starting another cooldown period.
This provides self-healing for recoverable systemic failures (DNS outages, GitLab maintenance windows) without requiring manual lore service resume for every transient hiccup. Truly persistent problems (bad token, config corruption) are caught by the permanent error classifier and go directly to paused without the half-open mechanism.
The circuit_breaker_cooldown_seconds is stored in the manifest alongside max_transient_failures. Both are hardcoded defaults for v1 (10 failures, 30-minute cooldown) but can be made configurable in a future iteration.
Acceptance criteria:
- Hidden from
--help(use#[command(hide = true)]) - Always runs in robot mode regardless of
--robotflag - Acquires pipeline-level lock before executing sync
- Executes stages independently and records per-stage outcomes
- Retries transient core stage failures once (1-5s jittered delay) before counting as failed
- Permanent core stage errors are never retried — immediate pause
- Classifies core stage errors as transient or permanent
- Optional stage failures produce
degradedoutcome without triggering backoff - Respects backoff window from previous failures (reads
next_retry_at_msfrom status file) - Pauses on permanent errors instead of burning retries
- Trips circuit breaker after 10 consecutive transient failures
- Exit code is always 0 (the scheduler should not interpret exit codes as retry signals — lore manages its own retry logic)
Robot output (success):
{
"ok": true,
"data": {
"action": "sync_completed",
"outcome": "success",
"profile": "balanced",
"duration_seconds": 45.2,
"stage_results": [
{ "stage": "issues", "success": true, "items_updated": 12 },
{ "stage": "mrs", "success": true, "items_updated": 4 },
{ "stage": "docs", "success": true, "items_updated": 28 }
],
"consecutive_failures": 0
},
"meta": { "elapsed_ms": 45200 }
}
Robot output (degraded — optional stages failed):
{
"ok": true,
"data": {
"action": "sync_completed",
"outcome": "degraded",
"profile": "full",
"duration_seconds": 38.1,
"stage_results": [
{ "stage": "issues", "success": true, "items_updated": 12 },
{ "stage": "mrs", "success": true, "items_updated": 4 },
{ "stage": "docs", "success": true, "items_updated": 28 },
{ "stage": "embeddings", "success": false, "error": "Ollama unavailable" }
],
"consecutive_failures": 0
},
"meta": { "elapsed_ms": 38100 }
}
Robot output (skipped — backoff):
{
"ok": true,
"data": {
"action": "skipped",
"reason": "backoff",
"consecutive_failures": 3,
"next_retry_iso": "2026-02-09T14:30:00.000Z",
"remaining_seconds": 1842
},
"meta": { "elapsed_ms": 1 }
}
Robot output (paused — permanent error):
{
"ok": true,
"data": {
"action": "paused",
"reason": "AUTH_FAILED",
"message": "GitLab returned 401 Unauthorized",
"suggestion": "Rotate token, then run: lore service resume"
},
"meta": { "elapsed_ms": 1200 }
}
Robot output (paused — circuit breaker):
{
"ok": true,
"data": {
"action": "paused",
"reason": "CIRCUIT_BREAKER",
"message": "10 consecutive transient failures (last: NetworkError: connection refused)",
"consecutive_failures": 10,
"suggestion": "Check network/GitLab availability, then run: lore service resume"
},
"meta": { "elapsed_ms": 1200 }
}
lore service resume [--service <service_id|name>]
What it does: Clears the paused state (including half-open circuit breaker) and resets consecutive failures, allowing the scheduler to retry on the next interval.
User journey:
- User sees
lore service statusreportsscheduler: PAUSED - User fixes the underlying issue (rotates token, fixes config, etc.)
- User runs
lore service resume - CLI resets
consecutive_failuresto 0, clearspaused_reasonandlast_error_*fields - Next scheduled
service runwill attempt sync normally
Acceptance criteria:
- If not paused, exits cleanly with informational message ("Service is not paused")
- If not installed, exits cleanly with informational message ("Service is not installed")
- Does NOT trigger an immediate sync (just clears state — scheduler handles the next run)
- Robot and human output modes
Robot output:
{
"ok": true,
"data": {
"was_paused": true,
"previous_reason": "AUTH_FAILED",
"consecutive_failures_cleared": 5
},
"meta": { "elapsed_ms": 2 }
}
Human output:
Service resumed:
Previous state: PAUSED (AUTH_FAILED)
Failures cleared: 5
Next sync will run at the scheduled interval.
Or for circuit breaker:
Service resumed:
Previous state: PAUSED (CIRCUIT_BREAKER, 10 transient failures)
Failures cleared: 10
Next sync will run at the scheduled interval.
lore service pause [--reason <text>] [--service <service_id|name>]
What it does: Pauses scheduled execution without uninstalling the service. Useful for maintenance windows, debugging, or temporarily stopping syncs while the underlying infrastructure is being modified.
User journey:
- User runs
lore service pause --reason "GitLab maintenance window" - CLI writes
paused_reasonto the status file with the provided reason (or "Manually paused" if no reason given) - Next
service runwill see the paused state and exit immediately
Acceptance criteria:
- Sets
paused_reasonin the status file - Does NOT modify the OS scheduler (service remains installed and scheduled — it just no-ops)
- If already paused, updates the reason and reports
already_paused: true lore service resumeclears the pause (same as for other paused states)- Robot and human output modes
Robot output:
{
"ok": true,
"data": {
"service_id": "a1b2c3d4e5f6",
"paused": true,
"reason": "GitLab maintenance window",
"already_paused": false
},
"meta": { "elapsed_ms": 2 }
}
Human output:
Service paused (a1b2c3d4e5f6):
Reason: GitLab maintenance window
Resume with: lore service resume
lore service trigger [--ignore-backoff] [--service <service_id|name>]
What it does: Triggers an immediate one-off sync using the installed service profile and policy. Unlike running lore sync manually, this goes through the service policy layer (status file, stage-aware outcomes, error classification) — giving you the same behavior the scheduler would produce, but on-demand.
User journey:
- User runs
lore service trigger - CLI reads the manifest to determine profile
- By default, respects current backoff/paused state (reports skip reason if blocked)
- With
--ignore-backoff, bypasses backoff window (but NOT paused state — useresumefor that) - Executes
handle_service_runlogic - Updates status file with the run result
Acceptance criteria:
- Uses the installed profile from the manifest
- Default: respects backoff and paused states
--ignore-backoff: bypasses backoff window, still respects paused- If not installed, returns actionable error
- Robot and human output modes (same format as
service runoutput)
lore service repair [--service <service_id|name>]
What it does: Repairs corrupt manifest or status files by backing them up and reinitializing. This is a safe alternative to manually deleting files and reinstalling.
User journey:
- User runs
lore service repair(typically after seeingServiceCorruptStateerrors) - CLI checks manifest and status files for JSON parseability
- If corrupt: renames the corrupt file to
{name}.corrupt.{timestamp}(backup, not delete) - Reinitializes the status file to default state
- If manifest is corrupt, reports that reinstallation is needed
- Outputs what was repaired
Acceptance criteria:
- Never deletes files — backs up corrupt files with
.corrupt.{timestamp}suffix - If both files are valid, reports "No repair needed" (exit 0)
- If manifest is corrupt, clears it and advises
lore service install - If status file is corrupt, reinitializes to default
- Robot and human output modes
Robot output:
{
"ok": true,
"data": {
"repaired": true,
"actions": [
{ "file": "sync-status-a1b2c3d4e5f6.json", "action": "reinitialized", "backup": "sync-status-a1b2c3d4e5f6.json.corrupt.1707480000" }
],
"needs_reinstall": false
},
"meta": { "elapsed_ms": 5 }
}
Human output:
Service repaired (a1b2c3d4e5f6):
Reinitialized: sync-status-a1b2c3d4e5f6.json
Backed up: sync-status-a1b2c3d4e5f6.json.corrupt.1707480000
Install Manifest
Location
{get_data_dir()}/service-manifest-{service_id}.json — e.g., ~/.local/share/lore/service-manifest-a1b2c3d4e5f6.json
Purpose
Avoids brittle parsing of platform-specific files (plist XML, systemd units) to recover install configuration. service status reads the manifest first, then verifies platform state matches. The service_id suffix enables multiple coexisting installations for different workspaces.
Schema
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServiceManifest {
/// Schema version for forward compatibility (start at 1)
pub schema_version: u32,
/// Stable identity for this service installation
pub service_id: String,
/// Canonical workspace root used in identity derivation
pub workspace_root: String,
/// When the service was first installed
pub installed_at_iso: String,
/// When the manifest was last written
pub updated_at_iso: String,
/// Platform backend
pub platform: String,
/// Configured interval in seconds
pub interval_seconds: u64,
/// Sync profile (fast/balanced/full)
pub profile: String,
/// Absolute path to the lore binary
pub binary_path: String,
/// Optional config path override
#[serde(skip_serializing_if = "Option::is_none")]
pub config_path: Option<String>,
/// How the token is stored
pub token_source: String,
/// Token environment variable name
pub token_env_var: String,
/// Paths to generated service files
pub service_files: Vec<String>,
/// The exact command the scheduler runs
pub sync_command: String,
/// Circuit breaker threshold (consecutive transient failures before pause)
pub max_transient_failures: u32,
/// Cooldown period before circuit breaker enters half-open probe state (seconds)
pub circuit_breaker_cooldown_seconds: u64,
/// SHA-256 hash of generated scheduler artifacts (plist/unit/wrapper content).
/// Used for spec-level drift detection: if file content on disk doesn't match
/// this hash, something external modified the service files.
pub spec_hash: String,
}
service_id derivation
/// Compute a stable service ID from a canonical identity tuple:
/// (workspace_root + config_path + sorted project URLs).
///
/// This avoids collisions when multiple workspaces share one global config
/// by incorporating what is being synced (project URLs) and where the workspace
/// lives alongside the config location.
/// Returns first 12 hex chars of SHA-256 (48 bits — collision-safe for local use).
pub fn compute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String {
use sha2::{Sha256, Digest};
let canonical_config = config_path.canonicalize()
.unwrap_or_else(|_| config_path.to_path_buf());
let canonical_workspace = workspace_root.canonicalize()
.unwrap_or_else(|_| workspace_root.to_path_buf());
let mut hasher = Sha256::new();
hasher.update(canonical_workspace.to_string_lossy().as_bytes());
hasher.update(b"\0");
hasher.update(canonical_config.to_string_lossy().as_bytes());
// Sort URLs for determinism regardless of config ordering
let mut urls: Vec<&str> = project_urls.to_vec();
urls.sort_unstable();
for url in &urls {
hasher.update(b"\0"); // separator to prevent concatenation collisions
hasher.update(url.as_bytes());
}
let hash = hasher.finalize();
hex::encode(&hash[..6]) // 12 hex chars
}
/// Sanitize a user-provided name to [a-z0-9-], max 32 chars.
pub fn sanitize_service_name(name: &str) -> Result<String, String> {
let sanitized: String = name.to_lowercase()
.chars()
.map(|c| if c.is_ascii_alphanumeric() || c == '-' { c } else { '-' })
.collect();
let trimmed = sanitized.trim_matches('-').to_string();
if trimmed.is_empty() {
return Err("Service name must contain at least one alphanumeric character".into());
}
if trimmed.len() > 32 {
return Err("Service name must be 32 characters or fewer".into());
}
Ok(trimmed)
}
Read/Write
ServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError>— returnsOk(None)if file doesn't exist,Errif file exists but is corrupt/unparseable (distinguishes missing from corrupt). Schema migration: If the file hasschema_version < CURRENT_VERSION, the read method migrates the in-memory model to the current version (adding default values for new fields) and atomically rewrites the file. If the file has an unknown futureschema_version(higher than current), it returnsErr(ServiceCorruptState)with an actionable message to updatelore.ServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()>— writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed.- Written by
service install, read byservice status,service run,service uninstall service uninstallremoves the manifest file
Status File
Location
{get_data_dir()}/sync-status-{service_id}.json — e.g., ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json
Add get_service_status_path(service_id: &str) to src/core/paths.rs.
Service-scoped status: Each installed service gets its own status file, keyed by service_id. This prevents cross-service contamination — a fast profile service pausing due to transient errors should not affect a full profile service's state. The pipeline lock remains global (sync_pipeline) to prevent overlapping writes to the shared database.
Schema
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncStatusFile {
/// Schema version for forward compatibility (start at 1)
pub schema_version: u32,
/// When this status file was last written
pub updated_at_iso: String,
/// Most recent run result (None if no runs yet — matches idle state)
#[serde(skip_serializing_if = "Option::is_none")]
pub last_run: Option<SyncRunRecord>,
/// Rolling window of recent runs (last 10, newest first)
#[serde(default)]
pub recent_runs: Vec<SyncRunRecord>,
/// Count of consecutive failures (resets to 0 on success or degraded outcome)
pub consecutive_failures: u32,
/// Persisted next retry time (set on failure, cleared on success/resume).
/// Computed once at failure time with jitter, then read-only comparison afterward.
/// This avoids recomputing jitter on every status check.
#[serde(skip_serializing_if = "Option::is_none")]
pub next_retry_at_ms: Option<i64>,
/// If set, service is paused due to a permanent error or circuit breaker
#[serde(skip_serializing_if = "Option::is_none")]
pub paused_reason: Option<String>,
/// Timestamp when circuit breaker entered paused state (for cooldown calculation)
#[serde(skip_serializing_if = "Option::is_none")]
pub circuit_breaker_paused_at_ms: Option<i64>,
/// Error code that caused the pause (for machine consumption)
#[serde(skip_serializing_if = "Option::is_none")]
pub last_error_code: Option<String>,
/// Error message from last failure
#[serde(skip_serializing_if = "Option::is_none")]
pub last_error_message: Option<String>,
/// In-flight run metadata for crash/stale detection. Written to the status file at run start,
/// cleared on completion (success or failure). If present when a new run starts, the previous
/// run crashed or was killed.
#[serde(skip_serializing_if = "Option::is_none")]
pub current_run: Option<CurrentRunState>,
}
/// Metadata for an in-flight sync run. Used to detect stale/crashed runs.
/// Written to the status file at run start, cleared on completion (success or failure).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CurrentRunState {
/// Unix timestamp (ms) when this run started
pub started_at_ms: i64,
/// PID of the process executing this run
pub pid: u32,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SyncRunRecord {
/// ISO-8601 timestamp of this sync run
pub timestamp_iso: String,
/// Unix timestamp in milliseconds
pub timestamp_ms: i64,
/// How long the sync took
pub duration_seconds: f64,
/// Run outcome: "success", "degraded", or "failed"
pub outcome: String,
/// Per-stage results (only present in detailed records, not in recent_runs summary)
#[serde(default, skip_serializing_if = "Vec::is_empty")]
pub stage_results: Vec<StageResult>,
/// Error message if sync failed (None on success/degraded)
#[serde(skip_serializing_if = "Option::is_none")]
pub error_message: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StageResult {
/// Stage name: "issues", "mrs", "docs", "embeddings"
pub stage: String,
/// Whether this stage completed successfully
pub success: bool,
/// Number of items created/updated (0 on failure)
#[serde(default)]
pub items_updated: usize,
/// Error message if stage failed
#[serde(skip_serializing_if = "Option::is_none")]
pub error: Option<String>,
/// Machine-readable error code from the underlying LoreError (e.g., "AUTH_FAILED", "NETWORK_ERROR").
/// Propagated through the stage execution layer for reliable error classification.
/// Falls back to string matching on `error` field when not available.
#[serde(skip_serializing_if = "Option::is_none")]
pub error_code: Option<String>,
}
Read/Write
SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError>— returnsOk(None)if file doesn't exist,Errif file exists but is corrupt/unparseable (distinguishes missing from corrupt — a corrupt status file is a warning, not fatal). Schema migration: Same behavior asServiceManifest::read— migrates older versions to current, rejects unknown future versions.SyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()>— writes to tmp file in same directory, fsyncs, then renames over target. Creates parent dirs if needed. Atomic writes prevent truncated JSON from crashes during write.SyncStatusFile::record_run(&mut self, run: SyncRunRecord)— pushes torecent_runs(capped at 10), updateslast_runSyncStatusFile::clear_paused(&mut self)— clearspaused_reason,circuit_breaker_paused_at_ms,last_error_*,next_retry_at_ms, resetsconsecutive_failures- File is NOT a fatal error source — if write fails, log a warning and continue (sync result matters more than recording it)
Backoff Logic
Backoff applies only to transient errors and only within service run. Manual lore sync is never subject to backoff. Permanent errors bypass backoff entirely and enter the paused state.
Key design change (from feedback): Instead of recomputing jitter on every service status / service run check, we compute next_retry_at_ms once at failure time and persist it. This makes status output stable, avoids predictable jitter from timestamp-seeded determinism, and simplifies the read path to a single comparison.
/// Injectable time source for deterministic testing.
pub trait Clock: Send + Sync {
fn now_ms(&self) -> i64;
}
/// Production clock using chrono.
pub struct SystemClock;
impl Clock for SystemClock {
fn now_ms(&self) -> i64 {
chrono::Utc::now().timestamp_millis()
}
}
/// Injectable RNG for deterministic jitter tests.
pub trait JitterRng: Send + Sync {
/// Returns a value in [0.0, 1.0)
fn next_f64(&mut self) -> f64;
}
/// Production RNG using thread_rng.
pub struct ThreadJitterRng;
impl JitterRng for ThreadJitterRng {
fn next_f64(&mut self) -> f64 {
use rand::Rng;
rand::thread_rng().gen()
}
}
impl SyncStatusFile {
/// Check if we're still in a backoff window.
/// Returns None if sync should proceed.
/// Returns Some(remaining_seconds) if within backoff window.
/// Reads the persisted `next_retry_at_ms` — no jitter computation on the read path.
pub fn backoff_remaining(&self, clock: &dyn Clock) -> Option<u64> {
// Paused state is handled separately (not via backoff)
if self.paused_reason.is_some() {
return None; // caller checks paused_reason directly
}
if self.consecutive_failures == 0 {
return None;
}
let next_retry = self.next_retry_at_ms?;
let now_ms = clock.now_ms();
if now_ms < next_retry {
Some(((next_retry - now_ms) / 1000) as u64)
} else {
None // backoff expired, proceed
}
}
/// Compute and set next_retry_at_ms after a transient failure.
/// Called once at failure time — jitter is applied here, not on reads.
/// Uses the *configured* interval as the backoff base (not a hardcoded value).
/// If the server provided a retry hint (e.g., Retry-After header), it is
/// respected as a floor: next_retry_at_ms = max(computed_backoff, hint).
pub fn set_backoff(
&mut self,
base_interval_seconds: u64,
clock: &dyn Clock,
rng: &mut dyn JitterRng,
retry_after_ms: Option<i64>,
) {
let exponent = (self.consecutive_failures - 1).min(20); // prevent overflow
let base_backoff = (base_interval_seconds as u128)
.saturating_mul(1u128 << exponent)
.min(4 * 3600) as u64; // cap at 4 hours
// Full jitter: uniform random in [base_interval..cap]
// This decorrelates retries across multiple installations while ensuring
// the minimum backoff is always at least the configured interval.
let jitter_factor = rng.next_f64(); // 0.0..1.0
let min_backoff = base_interval_seconds;
let span = base_backoff.saturating_sub(min_backoff);
let backoff_secs = min_backoff + ((span as f64) * jitter_factor) as u64;
let computed_retry_at = clock.now_ms() + (backoff_secs as i64 * 1000);
// Respect server-provided retry hint as a floor
self.next_retry_at_ms = Some(match retry_after_ms {
Some(hint) => computed_retry_at.max(hint),
None => computed_retry_at,
});
}
}
Key design decisions:
next_retry_at_msis computed once on failure and persisted —service statussimply reads it for stable, consistent display- Backoff base is the configured interval, not a hardcoded 1800s — a user with
--interval 5mgets shorter backoffs than one with--interval 1h - Full jitter (random in
[base_interval..cap]) decorrelates retries across multiple installations, avoiding thundering herd - Injectable
JitterRngtrait enables deterministic testing without seeding from timestamps - Paused state is checked separately from backoff — they are orthogonal concerns
next_retry_at_msis cleared on success and onservice resume
Backoff examples with 30m (1800s) base interval:
| consecutive_failures | max_backoff_seconds | human-readable range |
|---|---|---|
| 1 | 1800 | 30 min (jittered within [30m, 30m]) |
| 2 | 3600 | up to 1 hour (min 30m) |
| 3 | 7200 | up to 2 hours (min 30m) |
| 4 | 14400 | up to 4 hours (capped, min 30m) |
| 5-9 | 14400 | up to 4 hours (capped, min 30m) |
| 10 | — | circuit breaker trips → paused |
Service Run Implementation (handle_service_run)
Critical: Backoff, error classification, circuit breaker, stage-aware execution, and status file management live only in handle_service_run. The manual handle_sync_cmd is NOT modified — it does not read or write the service status file.
Location: src/cli/commands/service/run.rs
pub fn handle_service_run(service_id: &str, start: std::time::Instant) -> Result<(), Box<dyn std::error::Error>> {
let clock = SystemClock;
let mut rng = ThreadJitterRng;
// 1. Read manifest for the given service_id
let manifest_path = lore::core::paths::get_service_manifest_path(service_id);
let manifest = ServiceManifest::read(&manifest_path)?
.ok_or_else(|| LoreError::ServiceError {
message: format!("Service manifest not found for service_id '{service_id}'. Is the service installed?"),
})?;
// 2. Read status file (service-scoped)
let status_path = lore::core::paths::get_service_status_path(&manifest.service_id);
let mut status = match SyncStatusFile::read(&status_path) {
Ok(Some(s)) => s,
Ok(None) => SyncStatusFile::default(),
Err(e) => {
tracing::warn!(error = %e, "Corrupt status file, starting fresh");
SyncStatusFile::default()
}
};
// 3. Check paused state (permanent error or circuit breaker)
if let Some(reason) = &status.paused_reason {
// Check for circuit breaker half-open transition
let is_circuit_breaker = reason.starts_with("CIRCUIT_BREAKER");
let half_open = is_circuit_breaker
&& status.circuit_breaker_paused_at_ms.map_or(false, |paused_at| {
let cooldown_ms = (manifest.circuit_breaker_cooldown_seconds as i64) * 1000;
clock.now_ms() >= paused_at + cooldown_ms
});
if half_open {
// Cooldown expired — allow probe run (continue to step 5)
tracing::info!("Circuit breaker half-open: allowing probe run");
} else {
print_robot_json(json!({
"ok": true,
"data": {
"action": "paused",
"reason": reason,
"suggestion": if is_circuit_breaker {
format!("Waiting for cooldown ({}s). Or run: lore service resume",
manifest.circuit_breaker_cooldown_seconds)
} else {
"Fix the issue, then run: lore service resume".to_string()
}
},
"meta": { "elapsed_ms": start.elapsed().as_millis() }
}));
return Ok(());
}
}
// 4. Check backoff (reads persisted next_retry_at_ms — no jitter computation)
if let Some(remaining) = status.backoff_remaining(&clock) {
print_robot_json(json!({
"ok": true,
"data": {
"action": "skipped",
"reason": "backoff",
"consecutive_failures": status.consecutive_failures,
"next_retry_iso": status.next_retry_at_ms.map(|ms| {
chrono::DateTime::from_timestamp_millis(ms)
.map(|dt| dt.to_rfc3339())
}),
"remaining_seconds": remaining,
},
"meta": { "elapsed_ms": start.elapsed().as_millis() }
}));
return Ok(());
}
// 5. Acquire pipeline lock
let lock = match AppLock::try_acquire("sync_pipeline", stale_minutes) {
Ok(lock) => lock,
Err(_) => {
print_robot_json(json!({
"ok": true,
"data": { "action": "skipped", "reason": "locked" },
"meta": { "elapsed_ms": start.elapsed().as_millis() }
}));
return Ok(());
}
};
// 6. Write current_run metadata for stale-run detection
status.current_run = Some(CurrentRunState {
started_at_ms: clock.now_ms(),
pid: std::process::id(),
});
let _ = status.write_atomic(&status_path); // best-effort
// 7. Build sync args from profile
let sync_args = manifest.profile_to_sync_args();
// 8. Execute sync pipeline stage-by-stage
let stage_results = execute_sync_stages(&sync_args);
// 8. Classify outcome
let core_failed = stage_results.iter()
.any(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);
let optional_failed = stage_results.iter()
.any(|s| (s.stage == "docs" || s.stage == "embeddings") && !s.success);
let all_success = stage_results.iter().all(|s| s.success);
let outcome = if all_success {
"success"
} else if !core_failed && optional_failed {
"degraded"
} else {
"failed"
};
let run = SyncRunRecord {
timestamp_iso: chrono::Utc::now().to_rfc3339(),
timestamp_ms: clock.now_ms(),
duration_seconds: start.elapsed().as_secs_f64(),
outcome: outcome.to_string(),
stage_results: stage_results.clone(),
error_message: if outcome == "failed" {
stage_results.iter()
.find(|s| !s.success)
.and_then(|s| s.error.clone())
} else {
None
},
};
status.record_run(run);
match outcome {
"success" | "degraded" => {
// Degraded does NOT count as a failure — core data is fresh
status.consecutive_failures = 0;
status.next_retry_at_ms = None;
status.paused_reason = None;
status.last_error_code = None;
status.last_error_message = None;
}
"failed" => {
let core_error = stage_results.iter()
.find(|s| (s.stage == "issues" || s.stage == "mrs") && !s.success);
// Check if the underlying error is permanent
if let Some(stage) = core_error {
if is_permanent_stage_error(stage) {
status.paused_reason = Some(format!(
"{}: {}",
stage.stage,
stage.error.as_deref().unwrap_or("unknown error")
));
status.last_error_code = Some("PERMANENT".to_string());
status.last_error_message = stage.error.clone();
// Don't increment consecutive_failures — we're pausing
} else {
status.consecutive_failures = status.consecutive_failures.saturating_add(1);
status.last_error_code = Some("TRANSIENT".to_string());
status.last_error_message = stage.error.clone();
// Circuit breaker check
if status.consecutive_failures >= manifest.max_transient_failures {
status.paused_reason = Some(format!(
"CIRCUIT_BREAKER: {} consecutive transient failures (last: {})",
status.consecutive_failures,
stage.error.as_deref().unwrap_or("unknown")
));
status.circuit_breaker_paused_at_ms = Some(clock.now_ms());
status.next_retry_at_ms = None; // paused, not backing off
} else {
// Extract retry hint from stage error if available (e.g., Retry-After header)
let retry_hint = extract_retry_after_hint(stage);
status.set_backoff(manifest.interval_seconds, &clock, &mut rng, retry_hint);
}
}
}
}
_ => unreachable!(),
}
// 9. Clear current_run (run is complete)
status.current_run = None;
// 10. Write status atomically (best-effort)
if let Err(e) = status.write_atomic(&status_path) {
tracing::warn!(error = %e, "Failed to write sync status file");
}
// 10. Release lock (drop)
drop(lock);
// 11. Print result
print_robot_json(json!({
"ok": true,
"data": {
"action": if outcome == "failed" && status.paused_reason.is_some() { "paused" } else { "sync_completed" },
"outcome": outcome,
"profile": manifest.profile,
"duration_seconds": start.elapsed().as_secs_f64(),
"stage_results": stage_results,
"consecutive_failures": status.consecutive_failures,
},
"meta": { "elapsed_ms": start.elapsed().as_millis() }
}));
Ok(())
}
Error classification helpers
/// Classify by ErrorCode (used when we have the LoreError directly)
fn is_permanent_error(e: &LoreError) -> bool {
matches!(
e.code(),
ErrorCode::TokenNotSet
| ErrorCode::AuthFailed
| ErrorCode::ConfigNotFound
| ErrorCode::ConfigInvalid
| ErrorCode::MigrationFailed
)
}
/// Classify from error_code string (primary) or error message string (fallback).
/// The error_code field is propagated through stage execution and is the
/// preferred classification mechanism. String matching on the error message
/// is a fallback for stages that don't yet propagate error_code.
fn is_permanent_stage_error(stage: &StageResult) -> bool {
// Primary: classify by machine-readable error code
if let Some(code) = &stage.error_code {
return matches!(
code.as_str(),
"TOKEN_NOT_SET" | "AUTH_FAILED" | "CONFIG_NOT_FOUND"
| "CONFIG_INVALID" | "MIGRATION_FAILED"
);
}
// Fallback: string matching (for stages that don't yet propagate error_code)
stage.error.as_deref().map_or(false, |m| {
m.contains("401 Unauthorized")
|| m.contains("TokenNotSet")
|| m.contains("ConfigNotFound")
|| m.contains("ConfigInvalid")
|| m.contains("MigrationFailed")
})
}
Implementation note: The
error_codefield onStageResultis the primary classification mechanism. Each stage's execution wrapper should catchLoreError, extract itsErrorCodevia.code().to_string(), and populate theerror_codefield. The string-matching fallback exists for robustness but should not be the primary path.
Pipeline lock
The sync_pipeline lock uses the existing AppLock mechanism (same as the ingest lock). It prevents:
- Two
service runinvocations overlapping (if scheduler fires before previous run completes) - A
service runoverlapping with a manuallore sync(the manual sync should also acquire this lock)
Change to handle_sync_cmd: Add sync_pipeline lock acquisition at the top of handle_sync_cmd as well. This is the only change to the manual sync path — no backoff, no status file writes. If the lock is already held by a service run, manual sync waits briefly then fails with a clear message ("A scheduled sync is in progress. Wait for it to complete or use --force to override.").
// In handle_sync_cmd, after config load:
let _pipeline_lock = AppLock::try_acquire("sync_pipeline", stale_lock_minutes)
.map_err(|_| LoreError::ServiceError {
message: "Another sync is in progress. Wait for it to complete or use --force.".into(),
})?;
Platform Backends
Architecture
src/cli/commands/service/platform/mod.rs exports free functions that dispatch via #[cfg(target_os)]. All functions take service_id to construct platform-specific identifiers:
pub fn install(service_id: &str, ...) -> Result<InstallResult> {
#[cfg(target_os = "macos")]
return launchd::install(service_id, ...);
#[cfg(target_os = "linux")]
return systemd::install(service_id, ...);
#[cfg(target_os = "windows")]
return schtasks::install(service_id, ...);
#[cfg(not(any(target_os = "macos", target_os = "linux", target_os = "windows")))]
return Err(LoreError::ServiceUnsupported);
}
Same pattern for uninstall(), is_installed(), get_state(), service_file_paths(), platform_name().
Architecture note: A
SchedulerBackendtrait is the target architecture for deterministic integration testing with aFakeBackendthat simulates install/uninstall/state without touching the OS. For v1, the#[cfg]dispatch +run_cmdhelper provides adequate testability — unit tests validate template generation (string output, no OS calls) andrun_cmdcaptures all OS interactions with kill+reap timeout handling. The function signatures already mirror the trait shape (install,uninstall,is_installed,get_state,service_file_paths,check_prerequisites), making the trait extraction a low-risk refactoring target for v2. When extracted, the trait should be parameterized byservice_idand returnResult<T>for all operations.
Command Runner Helper
All platform backends use a shared run_cmd helper for consistent error handling:
/// Execute a system command with timeout and stderr capture.
/// Returns stdout on success, ServiceCommandFailed on failure.
/// On timeout, kills the child process and waits to reap it (prevents zombie processes).
fn run_cmd(program: &str, args: &[&str], timeout_secs: u64) -> Result<String> {
let mut child = std::process::Command::new(program)
.args(args)
.stdout(std::process::Stdio::piped())
.stderr(std::process::Stdio::piped())
.spawn()
.map_err(|e| LoreError::ServiceCommandFailed {
cmd: format!("{} {}", program, args.join(" ")),
exit_code: None,
stderr: e.to_string(),
})?;
// Wait with timeout; on timeout kill and reap
// This prevents process leaks that can wedge repeated runs.
let output = wait_with_timeout_kill_and_reap(&mut child, timeout_secs)?;
if output.status.success() {
Ok(String::from_utf8_lossy(&output.stdout).to_string())
} else {
Err(LoreError::ServiceCommandFailed {
cmd: format!("{} {}", program, args.join(" ")),
exit_code: output.status.code(),
stderr: String::from_utf8_lossy(&output.stderr).to_string(),
})
}
}
/// Wait for child process with timeout. On timeout, sends SIGKILL and waits
/// for the process to be reaped (prevents zombie processes on Unix).
///
/// NOTE: stdout/stderr are read after exit. This is safe for scheduler commands
/// (launchctl, systemctl, schtasks) which produce small output. For commands
/// that could produce large output (>64KB), concurrent draining via threads or
/// `child.wait_with_output()` would be needed to prevent pipe backpressure deadlock.
fn wait_with_timeout_kill_and_reap(
child: &mut std::process::Child,
timeout_secs: u64,
) -> Result<std::process::Output> {
use std::time::{Duration, Instant};
let deadline = Instant::now() + Duration::from_secs(timeout_secs);
loop {
match child.try_wait() {
Ok(Some(status)) => {
let stdout = child.stdout.take().map_or(Vec::new(), |mut s| {
let mut buf = Vec::new();
std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
buf
});
let stderr = child.stderr.take().map_or(Vec::new(), |mut s| {
let mut buf = Vec::new();
std::io::Read::read_to_end(&mut s, &mut buf).unwrap_or(0);
buf
});
return Ok(std::process::Output { status, stdout, stderr });
}
Ok(None) => {
if Instant::now() >= deadline {
// Timeout: kill and reap
let _ = child.kill();
let _ = child.wait(); // reap to prevent zombie
return Err(LoreError::ServiceCommandFailed {
cmd: "(timeout)".into(),
exit_code: None,
stderr: format!("Process timed out after {timeout_secs}s"),
});
}
std::thread::sleep(Duration::from_millis(100));
}
Err(e) => return Err(LoreError::ServiceCommandFailed {
cmd: "(wait)".into(),
exit_code: None,
stderr: e.to_string(),
}),
}
}
}
This ensures all launchctl, systemctl, and schtasks failures produce consistent, machine-readable errors with the exact command, exit code, and stderr captured.
Token Storage Helper
/// Write token to a user-private env file, scoped by service_id.
/// Returns the path to the env file.
///
/// Rejects tokens containing NUL bytes or newlines to prevent env-file injection.
/// The token is written as a raw value (not shell-quoted) and read via `cat` in
/// the wrapper script, never `source`d or `eval`d.
fn write_token_env_file(
data_dir: &Path,
service_id: &str,
token_env_var: &str,
token_value: &str,
) -> Result<PathBuf> {
// Validate token content — reject values that could break env-file format
if token_value.contains('\0') || token_value.contains('\n') || token_value.contains('\r') {
return Err(LoreError::ServiceError {
message: "Token contains NUL or newline characters, which are not safe for env-file storage. \
Use --token-source embedded instead.".into(),
});
}
let env_path = data_dir.join(format!("service-env-{service_id}"));
let content = format!("{}={}\n", token_env_var, token_value);
// Write atomically: tmp file + fsync
let tmp_path = env_path.with_extension("tmp");
std::fs::write(&tmp_path, &content)?;
// Set permissions to 0600 (owner read/write only) BEFORE rename
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
std::fs::set_permissions(&tmp_path, std::fs::Permissions::from_mode(0o600))?;
}
std::fs::rename(&tmp_path, &env_path)?;
Ok(env_path)
}
Function signatures
pub struct InstallResult {
pub platform: String,
pub service_id: String,
pub interval_seconds: u64,
pub profile: String,
pub binary_path: String,
pub config_path: Option<String>,
pub service_files: Vec<String>,
pub sync_command: String,
pub token_env_var: String,
pub token_source: String, // "env_file", "embedded", or "system_env"
}
pub struct UninstallResult {
pub was_installed: bool,
pub service_id: String,
pub platform: String,
pub removed_files: Vec<String>,
}
pub fn install(
service_id: &str,
binary_path: &str,
config_path: Option<&str>,
interval_seconds: u64,
profile: &str,
token_env_var: &str,
token_value: &str,
token_source: &str,
log_dir: &Path,
data_dir: &Path,
) -> Result<InstallResult>;
pub fn uninstall(service_id: &str) -> Result<UninstallResult>;
pub fn is_installed(service_id: &str) -> bool;
pub fn get_state(service_id: &str) -> Option<String>; // "loaded", "running", etc.
pub fn service_file_paths(service_id: &str) -> Vec<PathBuf>;
pub fn platform_name() -> &'static str;
/// Pre-flight check for platform-specific prerequisites.
/// Returns a list of diagnostic results.
pub fn check_prerequisites() -> Vec<DiagnosticCheck>;
pub struct DiagnosticCheck {
pub name: String,
pub status: DiagnosticStatus, // Pass, Warn, Fail
pub message: Option<String>,
pub action: Option<String>, // Suggested fix command
}
macOS: launchd (platform/launchd.rs)
Service file: ~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plist
Label: com.gitlore.sync.{service_id}
Wrapper script approach: launchd cannot natively load environment files. Instead of embedding the token directly in the plist (which would persist it in a readable XML file), we generate a small wrapper shell script that reads the env file at runtime and execs lore. This keeps the token out of the plist entirely for the env-file strategy.
Wrapper script ({data_dir}/service-run-{service_id}.sh, mode 0700):
#!/bin/sh
# Generated by lore service install — do not edit
set -e
# Read token from env file (KEY=VALUE format) — never source/eval untrusted content
{token_env_var}="$(sed -n 's/^{token_env_var}=//p' "{data_dir}/service-env-{service_id}")"
export {token_env_var}
{config_export_line}
exec "{binary_path}" --robot service run --service-id "{service_id}"
Where {config_export_line} is either empty or export LORE_CONFIG_PATH="{config_path}".
Plist template (generated via format!(), no crate needed):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.gitlore.sync.{service_id}</string>
<key>ProgramArguments</key>
<array>
{program_arguments}
</array>
{env_dict}
<key>StartInterval</key>
<integer>{interval_seconds}</integer>
<key>RunAtLoad</key>
<true/>
<key>ProcessType</key>
<string>Background</string>
<key>Nice</key>
<integer>10</integer>
<key>LowPriorityIO</key>
<true/>
<key>StandardOutPath</key>
<string>{log_dir}/service-{service_id}-stdout.log</string>
<key>StandardErrorPath</key>
<string>{log_dir}/service-{service_id}-stderr.log</string>
<key>TimeOut</key>
<integer>600</integer>
</dict>
</plist>
Where {program_arguments} and {env_dict} depend on token_source:
-
env-file (default): The plist invokes the wrapper script instead of
loredirectly. No token appears in the plist.<string>{data_dir}/service-run-{service_id}.sh</string>{env_dict}is empty (the wrapper script handles environment setup). -
embedded: The plist invokes
loredirectly with the token embedded inEnvironmentVariables.<string>{binary_path}</string> <string>--robot</string> <string>service</string> <string>run</string> <string>--service-id</string> <string>{service_id}</string><key>EnvironmentVariables</key> <dict> <key>{token_env_var}</key> <string>{token_value}</string> {config_env_entry} </dict>
Where {config_env_entry} is either empty or:
<key>LORE_CONFIG_PATH</key>
<string>{config_path}</string>
XML escaping: The token value and paths must be XML-escaped. Write a helper fn xml_escape(s: &str) -> String that replaces &, <, >, ", ' with their XML entity equivalents. This is critical — tokens can contain & or <.
Install steps:
std::fs::create_dir_all(plist_path.parent())std::fs::write(&plist_path, plist_content)- Try
launchctl bootstrap gui/{uid} {plist_path}viastd::process::Command - If that fails (older macOS), fall back to
launchctl load {plist_path} - Get UID via safe wrapper:
fn current_uid() -> u32 { unsafe { libc::getuid() } }— isolated in a single-line function with#[allow(unsafe_code)]exemption sincegetuid()is trivially safe (no pointers, no mutation, always succeeds). Alternatively, use thenixcrate'snix::unistd::Uid::current()if already a dependency.
Uninstall steps:
- Try
launchctl bootout gui/{uid}/com.gitlore.sync.{service_id} - If that fails, try
launchctl unload {plist_path} std::fs::remove_file(&plist_path)(ignore error if doesn't exist)
State detection:
is_installed(service_id): check if plist file exists on diskget_state(service_id): runlaunchctl list com.gitlore.sync.{service_id}, parse exit code (0 = loaded, non-0 = not loaded)get_interval_seconds(service_id): read plist file, find<key>StartInterval</key>then next<integer>value via simple string search (no XML parser needed)
Platform prerequisites (check_prerequisites):
- Verify running in a GUI login session: check
launchctl print gui/{uid}succeeds. In SSH-only or headless contexts, launchd user agents won't load — returnFailwith action "Log in via GUI or use SSH with ForwardAgent". - This is a warning, not a hard block — some macOS setups (like
launchctl asuser) can work around it.
Linux: systemd (platform/systemd.rs)
Service files:
~/.config/systemd/user/lore-sync-{service_id}.service~/.config/systemd/user/lore-sync-{service_id}.timer
Service unit (hardened):
[Unit]
Description=Gitlore GitLab data sync ({service_id})
[Service]
Type=oneshot
ExecStart={binary_path} --robot service run --service-id {service_id}
WorkingDirectory={data_dir}
SuccessExitStatus=0
TimeoutStartSec=900
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths={data_dir}
{token_env_line}
{config_env_line}
Where {token_env_line} depends on token_source:
- env-file:
EnvironmentFile={data_dir}/service-env-{service_id}(systemd natively supports this — true file-based loading, no embedding) - embedded:
Environment={token_env_var}={token_value}
Where {config_env_line} is either empty or Environment=LORE_CONFIG_PATH={config_path}.
Hardening notes:
TimeoutStartSec=900— kills stuck syncs after 15 minutes (generous but bounded)NoNewPrivileges=true— prevents privilege escalationPrivateTmp=true— isolated /tmpProtectSystem=strict— read-only filesystem except explicitly allowed pathsProtectHome=read-only— read-only home directoryReadWritePaths={data_dir}— allows writing to the lore data directory (status files, logs, DB)
Timer unit:
[Unit]
Description=Gitlore sync timer ({service_id})
[Timer]
OnBootSec=5min
OnUnitInactiveSec={interval_seconds}s
AccuracySec=1min
Persistent=true
RandomizedDelaySec=60
[Install]
WantedBy=timers.target
Install steps:
std::fs::create_dir_all(unit_dir)- Write both files
- Run
systemctl --user daemon-reload - Run
systemctl --user enable --now lore-sync-{service_id}.timer
Uninstall steps:
- Run
systemctl --user disable --now lore-sync-{service_id}.timer(ignore error) - Remove both files
- Run
systemctl --user daemon-reload
State detection:
is_installed(service_id): check if timer file existsget_state(service_id): runsystemctl --user is-active lore-sync-{service_id}.timer, capture stdout ("active", "inactive", etc.)get_interval_seconds(service_id): read timer file, parseOnUnitInactiveSecvalue
Platform prerequisites (check_prerequisites):
- User manager running: Check
systemctl --user statusexits 0. If not, returnFailwith message "systemd user manager not running. Start a user session or contact your system administrator." - Linger enabled: Check
loginctl show-user $(whoami) --property=LingerreturnsLinger=yes. If not, returnWarnwith message "loginctl linger not enabled. Timer will not fire on reboot without an active login session." and actionloginctl enable-linger $(whoami). This is a warning, not a block — the timer works fine when the user is logged in.
Windows: schtasks (platform/schtasks.rs)
Task name: LoreSync-{service_id}
Install:
schtasks /create /tn "LoreSync-{service_id}" /tr "\"{binary_path}\" --robot service run --service-id {service_id}" /sc minute /mo {interval_minutes} /f
Note: /mo requires minutes, so convert seconds to minutes (round up). Minimum is 1 minute (but we enforce 5 minutes at the parse level).
Token handling on Windows: The env var must be set system-wide via setx or be present in the user's environment. Neither env-file nor embedded strategies apply — Windows scheduled tasks inherit the user's environment. Set token_source: "system_env" in the result and document this as a requirement.
Uninstall:
schtasks /delete /tn "LoreSync-{service_id}" /f
State detection:
is_installed(service_id): runschtasks /query /tn "LoreSync-{service_id}", check exit code (0 = exists)get_state(service_id): parse output ofschtasks /query /tn "LoreSync-{service_id}" /fo CSV /v, extract "Status" columnget_interval_seconds(service_id): parse "Repeat: Every" from verbose output, or store the value ourselves
Platform prerequisites (check_prerequisites):
- Verify
schtasksis available: runschtasks /?and check exit code. ReturnFailif not found.
Interval Parsing
/// Parse interval strings like "15m", "1h", "30m", "2h", "24h"
/// Only minutes (m) and hours (h) are accepted — seconds are not exposed
/// because the minimum interval is 5 minutes and sub-minute granularity
/// would be confusing for a scheduled sync.
pub fn parse_interval(input: &str) -> std::result::Result<u64, String> {
let input = input.trim();
let (num_str, multiplier) = if let Some(n) = input.strip_suffix('m') {
(n, 60u64)
} else if let Some(n) = input.strip_suffix('h') {
(n, 3600u64)
} else {
return Err(format!(
"Invalid interval '{input}'. Use format like 15m, 30m, 1h, 2h"
));
};
let num: u64 = num_str
.parse()
.map_err(|_| format!("Invalid number in interval: '{num_str}'"))?;
if num == 0 {
return Err("Interval must be greater than 0".to_string());
}
let seconds = num * multiplier;
if seconds < 300 {
return Err(format!(
"Minimum interval is 5m (got {input}, which is {seconds}s)"
));
}
if seconds > 86400 {
return Err(format!(
"Maximum interval is 24h (got {input}, which is {seconds}s)"
));
}
Ok(seconds)
}
Error Types
Additions to src/core/error.rs
ErrorCode enum:
ServiceError, // Add after Ambiguous
ServiceCommandFailed, // OS command (launchctl/systemctl/schtasks) failed
ServiceCorruptState, // Manifest or status file is corrupt/unparseable
ErrorCode::exit_code():
Self::ServiceError => 21,
Self::ServiceCommandFailed => 22,
Self::ServiceCorruptState => 23,
ErrorCode::Display:
Self::ServiceError => "SERVICE_ERROR",
Self::ServiceCommandFailed => "SERVICE_COMMAND_FAILED",
Self::ServiceCorruptState => "SERVICE_CORRUPT_STATE",
LoreError enum:
#[error("Service error: {message}")]
ServiceError { message: String },
#[error("Service management not supported on this platform. Requires macOS (launchd), Linux (systemd), or Windows (schtasks).")]
ServiceUnsupported,
#[error("Service command failed: {cmd} (exit {exit_code:?}): {stderr}")]
ServiceCommandFailed {
cmd: String,
exit_code: Option<i32>,
stderr: String,
},
#[error("Service state file corrupt: {path}: {reason}")]
ServiceCorruptState {
path: String,
reason: String,
},
LoreError::code():
Self::ServiceError { .. } => ErrorCode::ServiceError,
Self::ServiceUnsupported => ErrorCode::ServiceError,
Self::ServiceCommandFailed { .. } => ErrorCode::ServiceCommandFailed,
Self::ServiceCorruptState { .. } => ErrorCode::ServiceCorruptState,
LoreError::suggestion():
Self::ServiceError { .. } => Some("Check service status: lore service status\nRun diagnostics: lore service doctor\nView logs: lore service logs"),
Self::ServiceUnsupported => Some("Requires macOS (launchd), Linux (systemd), or Windows (schtasks)"),
Self::ServiceCommandFailed { .. } => Some("Check service logs: lore service logs\nRun diagnostics: lore service doctor\nTry reinstalling: lore service install"),
Self::ServiceCorruptState { .. } => Some("Run: lore service repair\nThen reinstall if needed: lore service install"),
LoreError::actions():
Self::ServiceError { .. } => vec!["lore service status", "lore service doctor", "lore service logs"],
Self::ServiceUnsupported => vec![],
Self::ServiceCommandFailed { .. } => vec!["lore service logs", "lore service doctor", "lore service install"],
Self::ServiceCorruptState { .. } => vec!["lore service repair", "lore service install"],
CLI Definition Changes
src/cli/mod.rs
Add to Commands enum (after Who(WhoArgs), before the hidden commands):
/// Manage the OS-native scheduled sync service
Service {
#[command(subcommand)]
command: ServiceCommand,
},
Add the ServiceCommand enum (can be in the same file or re-exported from service/mod.rs):
#[derive(Subcommand)]
pub enum ServiceCommand {
/// Install the scheduled sync service
Install {
/// Sync interval (e.g., 15m, 30m, 1h). Default: 30m. Min: 5m. Max: 24h.
#[arg(long, default_value = "30m")]
interval: String,
/// Sync profile: fast (issues+MRs), balanced (+ docs), full (+ embeddings)
#[arg(long, default_value = "balanced")]
profile: String,
/// Token storage: env-file (default, 0600 perms) or embedded (in service file)
#[arg(long, default_value = "env-file")]
token_source: String,
/// Custom service name (default: derived from config path hash).
/// Useful when managing multiple installations for readability.
#[arg(long)]
name: Option<String>,
/// Validate and render service files without writing or executing anything
#[arg(long)]
dry_run: bool,
},
/// Remove the scheduled sync service
Uninstall {
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
/// Uninstall all services
#[arg(long)]
all: bool,
},
/// List all installed services
List,
/// Show service status and last sync result
Status {
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
},
/// View service logs
Logs {
/// Show last N lines (default: 100)
#[arg(long)]
tail: Option<Option<usize>>,
/// Stream new log lines as they arrive (like tail -f)
#[arg(long)]
follow: bool,
/// Open log file in editor instead of printing to stdout
#[arg(long)]
open: bool,
/// Target a specific service by ID or name
#[arg(long)]
service: Option<String>,
},
/// Clear paused state and reset failure counter
Resume {
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
},
/// Pause scheduled execution without uninstalling
Pause {
/// Reason for pausing (shown in status output)
#[arg(long)]
reason: Option<String>,
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
},
/// Trigger an immediate one-off sync using installed profile
Trigger {
/// Bypass backoff window (still respects paused state)
#[arg(long)]
ignore_backoff: bool,
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
},
/// Repair corrupt manifest or status files
Repair {
/// Target a specific service by ID or name (default: current project)
#[arg(long)]
service: Option<String>,
},
/// Validate service environment and prerequisites
Doctor {
/// Skip network checks (token validation)
#[arg(long)]
offline: bool,
/// Attempt safe, non-destructive fixes for detected issues
#[arg(long)]
fix: bool,
},
/// Execute one scheduled sync attempt (called by OS scheduler, hidden from help)
#[command(hide = true)]
Run {
/// Internal selector injected by scheduler backend — identifies which
/// service manifest and status file to use for this run.
#[arg(long, hide = true)]
service_id: String,
},
}
src/cli/commands/mod.rs
Add:
pub mod service;
No re-exports needed — the dispatch goes through service::handle_install, etc. directly.
src/main.rs dispatch
Add import:
use lore::cli::ServiceCommand;
Add match arm (before the hidden commands):
Some(Commands::Service { command }) => {
handle_service(cli.config.as_deref(), command, robot_mode)
}
Add handler function:
fn handle_service(
config_override: Option<&str>,
command: ServiceCommand,
robot_mode: bool,
) -> Result<(), Box<dyn std::error::Error>> {
let start = std::time::Instant::now();
match command {
ServiceCommand::Install { interval, profile, token_source, name, dry_run } => {
lore::cli::commands::service::handle_install(
config_override, &interval, &profile, &token_source, name.as_deref(),
dry_run, robot_mode, start,
)
}
ServiceCommand::Uninstall { service, all } => {
lore::cli::commands::service::handle_uninstall(service.as_deref(), all, robot_mode, start)
}
ServiceCommand::List => {
lore::cli::commands::service::handle_list(robot_mode, start)
}
ServiceCommand::Status { service } => {
lore::cli::commands::service::handle_status(config_override, service.as_deref(), robot_mode, start)
}
ServiceCommand::Logs { tail, follow, open, service } => {
lore::cli::commands::service::handle_logs(tail, follow, open, service.as_deref(), robot_mode, start)
}
ServiceCommand::Resume { service } => {
lore::cli::commands::service::handle_resume(service.as_deref(), robot_mode, start)
}
ServiceCommand::Pause { reason, service } => {
lore::cli::commands::service::handle_pause(service.as_deref(), reason.as_deref(), robot_mode, start)
}
ServiceCommand::Trigger { ignore_backoff, service } => {
lore::cli::commands::service::handle_trigger(service.as_deref(), ignore_backoff, robot_mode, start)
}
ServiceCommand::Repair { service } => {
lore::cli::commands::service::handle_repair(service.as_deref(), robot_mode, start)
}
ServiceCommand::Doctor { offline, fix } => {
lore::cli::commands::service::handle_doctor(config_override, offline, fix, robot_mode, start)
}
ServiceCommand::Run { service_id } => {
// Always robot mode for scheduled execution
lore::cli::commands::service::handle_service_run(&service_id, start)
}
}
}
Autocorrect Registry
src/cli/autocorrect.rs
Add to COMMAND_FLAGS array (before the hidden commands):
("service", &["--interval", "--profile", "--token-source", "--name", "--dry-run", "--tail", "--follow", "--open", "--offline", "--fix", "--service", "--all", "--reason", "--ignore-backoff"]),
Important: The registry_covers_command_flags test in autocorrect.rs uses clap introspection to verify all flags are registered. Since service is a nested subcommand, verify whether this test recurses into subcommands. If it does, the test will fail without this entry. If it doesn't recurse (only checks top-level subcommands), the test passes but we should still add the entry for correctness.
Looking at the test (lines 868-908): it iterates cmd.get_subcommands() which gets the top-level subcommands. The Service variant uses #[command(subcommand)] which means clap will show service as a subcommand with its own sub-subcommands. The test won't recurse into install's flags, but service itself has no direct flags (only subcommands do), so an empty entry or omission would pass the test. Adding ("service", &["--interval"]) is conservative and correct — the --interval flag lives on the install sub-subcommand but won't cause issues.
However, detect_subcommand only finds the first positional arg. For lore service install --intervl 30m, it returns "service", not "install". So the --interval flag needs to be registered under "service" for fuzzy matching.
robot-docs Manifest
Addition to handle_robot_docs in src/main.rs
Add to the commands JSON object:
"service": {
"description": "Manage OS-native scheduled sync service",
"subcommands": {
"install": {
"description": "Install scheduled sync service",
"flags": ["--interval <duration>", "--profile <fast|balanced|full>", "--token-source <env-file|embedded>", "--name <optional>", "--dry-run"],
"defaults": { "interval": "30m", "profile": "balanced", "token_source": "env-file" },
"example": "lore --robot service install --interval 15m --profile fast",
"response_schema": {
"ok": "bool",
"data.platform": "string (launchd|systemd|schtasks)",
"data.service_id": "string",
"data.interval_seconds": "number",
"data.profile": "string",
"data.binary_path": "string",
"data.service_files": "[string]",
"data.token_source": "string (env_file|embedded|system_env)",
"data.no_change": "bool"
}
},
"uninstall": {
"description": "Remove scheduled sync service",
"flags": ["--service <service_id|name>", "--all"],
"example": "lore --robot service uninstall",
"response_schema": {
"ok": "bool",
"data.was_installed": "bool",
"data.service_id": "string",
"data.removed_files": "[string]"
}
},
"list": {
"description": "List all installed services",
"example": "lore --robot service list",
"response_schema": {
"ok": "bool",
"data.services": "[{service_id, platform, interval_seconds, profile, installed_at_iso, platform_state, drift}]"
}
},
"status": {
"description": "Show service status, scheduler state, and recent runs",
"flags": ["--service <service_id|name>"],
"example": "lore --robot service status",
"response_schema": {
"ok": "bool",
"data.installed": "bool",
"data.service_id": "string|null",
"data.platform": "string",
"data.interval_seconds": "number|null",
"data.profile": "string|null",
"data.scheduler_state": "string (idle|running|running_stale|degraded|backoff|half_open|paused|not_installed)",
"data.last_sync": "SyncRunRecord|null",
"data.recent_runs": "[SyncRunRecord]",
"data.backoff": "object|null",
"data.paused_reason": "string|null",
"data.drift": "object|null {platform_drift: bool, spec_drift: bool, command_drift: bool}"
}
},
"logs": {
"description": "View service logs (human: editor/tail, robot: path + optional lines)",
"flags": ["--tail <n>", "--follow"],
"example": "lore --robot service logs --tail 50",
"response_schema": {
"ok": "bool",
"data.log_path": "string",
"data.exists": "bool",
"data.size_bytes": "number",
"data.last_lines": "[string]|null"
}
},
"resume": {
"description": "Clear paused state and reset failure counter",
"example": "lore --robot service resume",
"response_schema": {
"ok": "bool",
"data.was_paused": "bool",
"data.previous_reason": "string|null",
"data.consecutive_failures_cleared": "number"
}
},
"pause": {
"description": "Pause scheduled execution without uninstalling",
"flags": ["--reason <text>", "--service <service_id|name>"],
"example": "lore --robot service pause --reason 'maintenance'",
"response_schema": {
"ok": "bool",
"data.service_id": "string",
"data.paused": "bool",
"data.reason": "string",
"data.already_paused": "bool"
}
},
"trigger": {
"description": "Trigger immediate one-off sync using installed profile",
"flags": ["--ignore-backoff", "--service <service_id|name>"],
"example": "lore --robot service trigger",
"response_schema": "Same as service run output"
},
"repair": {
"description": "Repair corrupt manifest or status files",
"flags": ["--service <service_id|name>"],
"example": "lore --robot service repair",
"response_schema": {
"ok": "bool",
"data.repaired": "bool",
"data.actions": "[{file, action, backup?}]",
"data.needs_reinstall": "bool"
}
},
"doctor": {
"description": "Validate service environment and prerequisites",
"flags": ["--offline", "--fix"],
"example": "lore --robot service doctor",
"response_schema": {
"ok": "bool",
"data.checks": "[{name, status, message?, action?}]",
"data.overall": "string (pass|warn|fail)"
}
}
}
}
Paths Module Additions
src/core/paths.rs
pub fn get_service_status_path(service_id: &str) -> PathBuf {
get_data_dir().join(format!("sync-status-{service_id}.json"))
}
pub fn get_service_manifest_path(service_id: &str) -> PathBuf {
get_data_dir().join(format!("service-manifest-{service_id}.json"))
}
pub fn get_service_env_path(service_id: &str) -> PathBuf {
get_data_dir().join(format!("service-env-{service_id}"))
}
pub fn get_service_wrapper_path(service_id: &str) -> PathBuf {
get_data_dir().join(format!("service-run-{service_id}.sh"))
}
pub fn get_service_log_path(service_id: &str, stream: &str) -> PathBuf {
get_data_dir().join("logs").join(format!("service-{service_id}-{stream}.log"))
}
// stream values: "stdout" or "stderr"
// Example: get_service_log_path("a1b2c3d4e5f6", "stderr")
// => ~/.local/share/lore/logs/service-a1b2c3d4e5f6-stderr.log
/// List all installed service IDs by scanning for manifest files.
pub fn list_service_ids() -> Vec<String> {
let data_dir = get_data_dir();
std::fs::read_dir(&data_dir)
.unwrap_or_else(|_| /* return empty iterator */)
.filter_map(|entry| {
let name = entry.ok()?.file_name().to_string_lossy().to_string();
name.strip_prefix("service-manifest-")
.and_then(|s| s.strip_suffix(".json"))
.map(String::from)
})
.collect()
}
Note: Status files are scoped by service_id — each installed service gets independent backoff/paused/circuit-breaker state. The pipeline lock remains global (sync_pipeline) to prevent overlapping writes to the shared database.
Core Module Registration
src/core/mod.rs
Add:
pub mod sync_status;
pub mod service_manifest;
File-by-File Implementation Details
src/core/sync_status.rs (NEW)
SyncRunRecordstruct with Serialize + Deserialize + CloneStageResultstruct with Serialize + Deserialize + CloneSyncStatusFilestruct with Serialize + Deserialize + Default (schema_version=1)Clocktrait +SystemClockimpl (for deterministic testing)JitterRngtrait +ThreadJitterRngimpl (for deterministic jitter testing)parse_interval(input: &str) -> Result<u64, String>SyncStatusFile::read(path: &Path) -> Result<Option<Self>, LoreError>— distinguishes missing from corruptSyncStatusFile::write_atomic(&self, path: &Path) -> std::io::Result<()>— tmp+fsync+renameSyncStatusFile::record_run(&mut self, run: SyncRunRecord)— push to recent_runs (capped at 10)SyncStatusFile::clear_paused(&mut self)— reset paused_reason, errors, failures, next_retry_at_msSyncStatusFile::backoff_remaining(&self, clock: &dyn Clock) -> Option<u64>— reads persisted next_retry_at_msSyncStatusFile::set_backoff(&mut self, base_interval_seconds, clock, rng)— compute and persist next_retry_at_msfn is_permanent_error(code: &ErrorCode) -> boolfn is_permanent_stage_error(stage: &StageResult) -> bool— primary: error_code, fallback: string matchingSyncStatusFile::is_circuit_breaker_half_open(&self, manifest: &ServiceManifest, clock: &dyn Clock) -> bool— checks if cooldown has expired- Unit tests for all of the above
src/core/service_manifest.rs (NEW)
ServiceManifeststruct with Serialize + Deserialize (schema_version=1), includesworkspace_rootandspec_hashServiceManifest::read(path: &Path) -> Result<Option<Self>, LoreError>— distinguishes missing from corruptServiceManifest::write_atomic(&self, path: &Path) -> std::io::Result<()>— tmp+fsync+renameServiceManifest::profile_to_sync_args(&self) -> Vec<String>— maps profile to sync CLI flagscompute_service_id(workspace_root: &Path, config_path: &Path, project_urls: &[&str]) -> String— composite fingerprint (workspace root + config path + sorted project URLs), first 12 hex chars of SHA-256sanitize_service_name(name: &str) -> Result<String, String>—[a-z0-9-], max 32 charsDiagnosticCheckstruct,DiagnosticStatusenum (Pass/Warn/Fail)- Unit tests for profile mapping, service_id computation, name sanitization
src/cli/commands/service/mod.rs (NEW)
- Re-exports from submodules:
handle_install,handle_uninstall,handle_list,handle_status,handle_logs,handle_resume,handle_pause,handle_trigger,handle_repair,handle_doctor,handle_service_run - Shared
resolve_service_id(selector: Option<&str>, config_override: Option<&str>) -> Result<String>helper: resolves--serviceflag, or derives from current config path. If multiple services exist and no selector provided, returns actionable error listing available services. - Shared
acquire_admin_lock(service_id: &str) -> Result<AppLock>helper: acquiresAppLock("service-admin-{service_id}")for state mutation commands. Used by install, uninstall, pause, resume, and repair. NOT used byservice run(which only acquiressync_pipeline). - Imports from submodules
src/cli/commands/service/install.rs (NEW)
handle_install(config_override, interval_str, profile, token_source, name, dry_run, robot_mode, start) -> Result<()>- Validates profile is one of
fast|balanced|full - Validates token_source is one of
env-file|embedded - Computes or validates
service_idfrom--nameor composite fingerprint (workspace root + config path + project URLs). If--nameis provided and collides with an existing service with a different identity hash, returns an actionable error. - Acquires admin lock
AppLock("service-admin-{service_id}")before mutating any files - Runs
doctorpre-flight checks; aborts on anyFailresult - Loads config, resolves token, resolves binary path
- Writes token to env file (if env-file strategy, scoped by service_id)
- On macOS with env-file: generates wrapper script at
{data_dir}/service-run-{service_id}.sh(mode 0700) - Calls
platform::install(service_id, ...) - Transactional: on enable success, writes install manifest atomically. On enable failure, removes generated service files and wrapper script, returns
ServiceCommandFailed. - Compares with existing manifest to detect no-change case
- Prints result (robot JSON or human-readable)
src/cli/commands/service/uninstall.rs (NEW)
handle_uninstall(service_selector, all, robot_mode, start) -> Result<()>- Resolves target service via selector or current-project default
- With
--all: iterates all discovered manifests - Reads manifest to find service_id
- Calls
platform::uninstall(service_id) - Removes install manifest (
service-manifest-{service_id}.json) - Removes env file (
service-env-{service_id}) if exists - Removes wrapper script (
service-run-{service_id}.sh) if exists (macOS) - Does NOT remove the status file or log files (those are operational data, not config)
- Outputs confirmation
src/cli/commands/service/status.rs (NEW)
handle_status(config_override, robot_mode, start) -> Result<()>- Reads install manifest (primary source for config and service_id)
- Calls
platform::is_installed(service_id),get_state(service_id)to verify platform state - Detects drift: platform drift (loaded/unloaded), spec drift (content hash vs
spec_hash), command drift - Reads
SyncStatusFilefor last sync and recent runs - Detects stale runs via
current_runmetadata: checks if PID is alive andstarted_at_msis within 30 minutes - Computes scheduler state from status + manifest (including
degraded,running_stale) - Computes backoff info from persisted
next_retry_at_ms - Prints combined status
src/cli/commands/service/logs.rs (NEW)
handle_logs(tail, follow, robot_mode, start) -> Result<()>--tail: read last N lines, output directly to stdout (or as JSON array in robot mode)--follow: stream new lines (human mode only; robot mode returns error)- Default (no flags): print last 100 lines to stdout (human) or return path metadata (robot)
- Robot mode with
--tail: includeslast_linesfield (capped at 100)
src/cli/commands/service/resume.rs (NEW)
handle_resume(robot_mode, start) -> Result<()>- Reads status file, clears paused state (including circuit breaker), writes back atomically
- Prints confirmation with previous reason
src/cli/commands/service/doctor.rs (NEW)
handle_doctor(config_override, offline, fix, robot_mode, start) -> Result<()>- Runs diagnostic checks: config, token, binary, data dir, platform prerequisites, install state
- Skips network checks when
--offline --fix: attempts safe, non-destructive remediations (create dirs, fix permissions, daemon-reload). Reports each applied fix.- Reports pass/warn/fail per check
- Also used as pre-flight by
handle_install(as an internal function call, without--fix)
src/cli/commands/service/run.rs (NEW)
handle_service_run(service_id: &str, start) -> Result<()>- The hidden scheduled execution entrypoint;
service_idis injected by the scheduler command line - Reads manifest for the given
service_idto get profile/interval/max_transient_failures/circuit_breaker_cooldown_seconds - Checks paused state with half-open transition (cooldown check), backoff (via persisted next_retry_at_ms), pipeline lock
- Writes
current_runmetadata (started_at_ms, pid) to status file before sync for stale-run detection; clears it on completion - Executes sync stage-by-stage, records per-stage outcomes with
error_codepropagation - Classifies: success / degraded / failed
- Respects server-provided
Retry-Afterhints when computing backoff (viaextract_retry_after_hint) - Circuit breaker check on transient failure count; records
circuit_breaker_paused_at_msfor cooldown - Half-open probe: if probe succeeds, auto-closes circuit breaker; if fails, returns to paused with new timestamp
- Performs log rotation check before executing sync
- Updates status atomically
- Always robot mode, always exit 0
src/cli/commands/service/list.rs (NEW)
handle_list(robot_mode, start) -> Result<()>- Scans
{data_dir}forservice-manifest-*.jsonfiles - Reads each manifest, verifies platform state, detects drift
- Outputs summary in robot JSON or human-readable table
src/cli/commands/service/pause.rs (NEW)
handle_pause(service_selector, reason, robot_mode, start) -> Result<()>- Resolves service, writes
paused_reasonto status file - Does NOT modify OS scheduler (service stays installed and scheduled — it just no-ops)
- Reports
already_paused: trueif already paused (updates reason)
src/cli/commands/service/trigger.rs (NEW)
handle_trigger(service_selector, ignore_backoff, robot_mode, start) -> Result<()>- Resolves service, reads manifest for profile
- Delegates to
handle_service_runlogic with optional backoff bypass - Still respects paused state (use
resumefirst)
src/cli/commands/service/repair.rs (NEW)
handle_repair(service_selector, robot_mode, start) -> Result<()>- Validates manifest and status files for JSON parseability
- Corrupt files: renamed to
{name}.corrupt.{timestamp}(backup, never delete) - Status file: reinitialized to default
- Manifest: cleared, advises reinstall
- Reports what was repaired
src/cli/commands/service/platform/mod.rs (NEW)
#[cfg]-gated imports and dispatch functions (all takeservice_id)fn xml_escape(s: &str) -> Stringhelper (used by launchd)fn run_cmd(program, args, timeout_secs) -> Result<String>— shared command runner with kill+reap on timeoutfn wait_with_timeout_kill_and_reap(child, timeout_secs) -> Result<Output>— timeout handler that kills and reaps child processfn write_token_env_file(data_dir, service_id, token_env_var, token_value) -> Result<PathBuf>— token storagefn write_wrapper_script(data_dir, service_id, binary_path, token_env_var, config_path) -> Result<PathBuf>— macOS wrapper script for runtime env loading (mode 0700)fn check_prerequisites() -> Vec<DiagnosticCheck>— platform-specific pre-flightfn write_atomic(path: &Path, content: &str) -> std::io::Result<()>— shared atomic write helper (tmp + fsync(file) + rename + fsync(parent_dir) for power-loss durability)
src/cli/commands/service/platform/launchd.rs (NEW, #[cfg(target_os = "macos")])
fn plist_path(service_id: &str) -> PathBuf—~/Library/LaunchAgents/com.gitlore.sync.{service_id}.plistfn generate_plist(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, token_source, log_dir, data_dir) -> String— generates plist with wrapper script (env-file) or direct invocation (embedded)fn generate_plist_with_wrapper(service_id, wrapper_path, interval_seconds, log_dir) -> String— env-file variant: ProgramArguments points to wrapper scriptfn generate_plist_with_embedded(service_id, binary_path, config_path, interval_seconds, token_env_var, token_value, log_dir) -> String— embedded variant: token in EnvironmentVariablesfn install(service_id, ...) -> Result<InstallResult>fn uninstall(service_id) -> Result<UninstallResult>fn is_installed(service_id) -> boolfn get_state(service_id) -> Option<String>fn get_interval_seconds(service_id) -> u64fn check_prerequisites() -> Vec<DiagnosticCheck>— GUI session check- Unit tests:
test_generate_plist_with_wrapper()— verify wrapper path in ProgramArguments, no token in plist - Unit tests:
test_generate_plist_with_embedded()— verify token in EnvironmentVariables - Unit tests: XML escaping, service_id in label
src/cli/commands/service/platform/systemd.rs (NEW, #[cfg(target_os = "linux")])
fn unit_dir() -> PathBuf—~/.config/systemd/user/fn generate_service(service_id, binary_path, config_path, token_env_var, token_value, token_source, data_dir) -> String— includes hardening directivesfn generate_timer(service_id, interval_seconds) -> Stringfn install(service_id, ...) -> Result<InstallResult>fn uninstall(service_id) -> Result<UninstallResult>- Same query functions as launchd (all scoped by service_id)
fn check_prerequisites() -> Vec<DiagnosticCheck>— user manager + linger checks- Unit test:
test_generate_service()(both env-file and embedded, verify hardening),test_generate_timer()
src/cli/commands/service/platform/schtasks.rs (NEW, #[cfg(target_os = "windows")])
fn install(service_id, ...) -> Result<InstallResult>fn uninstall(service_id) -> Result<UninstallResult>- Same query functions (scoped by service_id)
fn check_prerequisites() -> Vec<DiagnosticCheck>— schtasks availability- Note:
token_source: "system_env"— token must be in system environment
Testing Strategy
Test Infrastructure
Fake clock for deterministic time-dependent tests:
/// Test clock with controllable time
struct FakeClock {
now_ms: i64,
}
impl Clock for FakeClock {
fn now_ms(&self) -> i64 {
self.now_ms
}
}
Fake RNG for deterministic jitter tests:
/// Test RNG that returns a predetermined sequence of values
struct FakeJitterRng {
values: Vec<f64>,
index: usize,
}
impl FakeJitterRng {
fn new(values: Vec<f64>) -> Self {
Self { values, index: 0 }
}
}
impl JitterRng for FakeJitterRng {
fn next_f64(&mut self) -> f64 {
let val = self.values[self.index % self.values.len()];
self.index += 1;
val
}
}
This eliminates all time- and randomness-dependent flakiness. Every test sets an explicit "now" and jitter value, then asserts exact results.
Unit Tests (in src/core/sync_status.rs)
#[cfg(test)]
mod tests {
use super::*;
use tempfile::TempDir;
struct FakeClock { now_ms: i64 }
impl Clock for FakeClock {
fn now_ms(&self) -> i64 { self.now_ms }
}
struct FakeJitterRng { value: f64 }
impl FakeJitterRng {
fn new(value: f64) -> Self {
Self { value }
}
}
impl JitterRng for FakeJitterRng {
fn next_f64(&mut self) -> f64 {
self.value
}
}
// --- Interval parsing ---
#[test]
fn parse_interval_valid_minutes() {
assert_eq!(parse_interval("5m").unwrap(), 300);
assert_eq!(parse_interval("15m").unwrap(), 900);
assert_eq!(parse_interval("30m").unwrap(), 1800);
}
#[test]
fn parse_interval_valid_hours() {
assert_eq!(parse_interval("1h").unwrap(), 3600);
assert_eq!(parse_interval("2h").unwrap(), 7200);
assert_eq!(parse_interval("24h").unwrap(), 86400);
}
#[test]
fn parse_interval_too_short() {
assert!(parse_interval("1m").is_err());
assert!(parse_interval("4m").is_err());
}
#[test]
fn parse_interval_too_long() {
assert!(parse_interval("25h").is_err());
}
#[test]
fn parse_interval_invalid() {
assert!(parse_interval("0m").is_err());
assert!(parse_interval("abc").is_err());
assert!(parse_interval("").is_err());
assert!(parse_interval("m").is_err());
assert!(parse_interval("10x").is_err());
assert!(parse_interval("30s").is_err()); // seconds not supported
}
#[test]
fn parse_interval_trims_whitespace() {
assert_eq!(parse_interval(" 30m ").unwrap(), 1800);
}
// --- Status file persistence ---
#[test]
fn status_file_round_trip() {
let dir = TempDir::new().unwrap();
let path = dir.path().join("sync-status-test1234.json");
let mut status = SyncStatusFile::default();
let run = SyncRunRecord {
timestamp_iso: "2026-02-09T10:30:00Z".to_string(),
timestamp_ms: 1_770_609_000_000,
duration_seconds: 12.5,
outcome: "success".to_string(),
stage_results: vec![
StageResult { stage: "issues".into(), success: true, items_updated: 5, error: None },
StageResult { stage: "mrs".into(), success: true, items_updated: 3, error: None },
],
error_message: None,
};
status.record_run(run);
status.write_atomic(&path).unwrap();
let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
assert_eq!(loaded.last_run.as_ref().unwrap().outcome, "success");
assert_eq!(loaded.last_run.as_ref().unwrap().stage_results.len(), 2);
assert_eq!(loaded.consecutive_failures, 0);
assert_eq!(loaded.recent_runs.len(), 1);
assert_eq!(loaded.schema_version, 1);
}
#[test]
fn status_file_read_missing_returns_ok_none() {
let dir = TempDir::new().unwrap();
let path = dir.path().join("nonexistent.json");
assert!(SyncStatusFile::read(&path).unwrap().is_none());
}
#[test]
fn status_file_read_corrupt_returns_err() {
let dir = TempDir::new().unwrap();
let path = dir.path().join("corrupt.json");
std::fs::write(&path, "not valid json{{{").unwrap();
assert!(SyncStatusFile::read(&path).is_err());
}
#[test]
fn status_file_atomic_write_survives_crash() {
// Verify no partial writes by checking file is valid JSON after write
let dir = TempDir::new().unwrap();
let path = dir.path().join("sync-status-test1234.json");
let status = SyncStatusFile::default();
status.write_atomic(&path).unwrap();
// Read back and verify
let loaded = SyncStatusFile::read(&path).unwrap().unwrap();
assert_eq!(loaded.schema_version, 1);
}
#[test]
fn record_run_caps_at_10() {
let mut status = SyncStatusFile::default();
for i in 0..15 {
status.record_run(make_run(i * 1000, "success"));
}
assert_eq!(status.recent_runs.len(), 10);
}
#[test]
fn default_status_has_no_last_run() {
let status = SyncStatusFile::default();
assert!(status.last_run.is_none());
}
// --- Backoff (deterministic via FakeClock + persisted next_retry_at_ms) ---
#[test]
fn backoff_returns_none_when_zero_failures() {
let status = make_status("success", 0, 100_000);
let clock = FakeClock { now_ms: 200_000 };
assert!(status.backoff_remaining(&clock).is_none());
}
#[test]
fn backoff_returns_none_when_no_next_retry() {
let mut status = make_status("failed", 1, 100_000_000);
status.next_retry_at_ms = None;
let clock = FakeClock { now_ms: 200_000_000 };
assert!(status.backoff_remaining(&clock).is_none());
}
#[test]
fn backoff_active_within_window() {
let mut status = make_status("failed", 1, 100_000_000);
status.next_retry_at_ms = Some(100_000_000 + 1_800_000); // 30 min from now
let clock = FakeClock { now_ms: 100_000_000 + 1000 }; // 1s after failure
let remaining = status.backoff_remaining(&clock);
assert!(remaining.is_some());
assert_eq!(remaining.unwrap(), 1799);
}
#[test]
fn backoff_expired() {
let mut status = make_status("failed", 1, 100_000_000);
status.next_retry_at_ms = Some(100_000_000 + 1_800_000);
let clock = FakeClock { now_ms: 100_000_000 + 2_000_000 }; // past retry time
assert!(status.backoff_remaining(&clock).is_none());
}
#[test]
fn set_backoff_persists_next_retry() {
let mut status = make_status("failed", 1, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(0.5); // 0.5 for deterministic
status.set_backoff(1800, &clock, &mut rng, None);
assert!(status.next_retry_at_ms.is_some());
// With jitter=0.5, backoff = max(1800*0.5, 1800) = 1800s
let expected_ms = 100_000_000 + 1_800_000;
assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
}
#[test]
fn set_backoff_caps_at_4_hours() {
let mut status = make_status("failed", 20, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(1.0); // max jitter
status.set_backoff(1800, &clock, &mut rng, None);
// Cap: 4h = 14400s, jitter=1.0: max(14400*1.0, 1800) = 14400
let max_ms = 100_000_000 + 14_400_000;
assert!(status.next_retry_at_ms.unwrap() <= max_ms);
}
#[test]
fn set_backoff_minimum_is_base_interval() {
let mut status = make_status("failed", 1, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(0.0); // min jitter
status.set_backoff(1800, &clock, &mut rng, None);
// jitter=0.0: max(1800*0.0, 1800) = 1800 (minimum enforced)
let expected_ms = 100_000_000 + 1_800_000;
assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
}
#[test]
fn set_backoff_respects_retry_after_hint() {
let mut status = make_status("failed", 1, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(0.0); // min jitter => computed backoff = 1800s
let hint = 100_000_000 + 3_600_000; // server says retry after 1 hour
status.set_backoff(1800, &clock, &mut rng, Some(hint));
// Hint (1h) > computed backoff (30m), so hint wins
assert_eq!(status.next_retry_at_ms.unwrap(), hint);
}
#[test]
fn set_backoff_ignores_hint_when_computed_is_larger() {
let mut status = make_status("failed", 1, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(0.0);
let hint = 100_000_000 + 60_000; // server says retry after 1 minute
status.set_backoff(1800, &clock, &mut rng, Some(hint));
// Computed (30m) > hint (1m), so computed wins
let expected_ms = 100_000_000 + 1_800_000;
assert_eq!(status.next_retry_at_ms.unwrap(), expected_ms);
}
#[test]
fn set_backoff_uses_configured_interval_not_hardcoded() {
let mut status1 = make_status("failed", 1, 100_000_000);
let mut status2 = make_status("failed", 1, 100_000_000);
let clock = FakeClock { now_ms: 100_000_000 };
let mut rng = FakeJitterRng::new(0.5);
status1.set_backoff(300, &clock, &mut rng, None); // 5m base
rng.value = 0.5; // reset
status2.set_backoff(3600, &clock, &mut rng, None); // 1h base
// 5m base should produce shorter backoff than 1h base
assert!(status1.next_retry_at_ms.unwrap() < status2.next_retry_at_ms.unwrap());
}
#[test]
fn backoff_skips_when_paused() {
let mut status = make_status("failed", 3, 100_000_000);
status.paused_reason = Some("AUTH_FAILED".to_string());
status.next_retry_at_ms = Some(100_000_000 + 999_999_999);
let clock = FakeClock { now_ms: 100_000_000 + 1000 };
// Paused state is checked separately, backoff_remaining returns None
assert!(status.backoff_remaining(&clock).is_none());
}
// --- Error classification ---
#[test]
fn permanent_errors_classified_correctly() {
assert!(is_permanent_error(&ErrorCode::TokenNotSet));
assert!(is_permanent_error(&ErrorCode::AuthFailed));
assert!(is_permanent_error(&ErrorCode::ConfigNotFound));
assert!(is_permanent_error(&ErrorCode::ConfigInvalid));
assert!(is_permanent_error(&ErrorCode::MigrationFailed));
}
#[test]
fn transient_errors_classified_correctly() {
assert!(!is_permanent_error(&ErrorCode::NetworkError));
assert!(!is_permanent_error(&ErrorCode::RateLimited));
assert!(!is_permanent_error(&ErrorCode::DbLocked));
assert!(!is_permanent_error(&ErrorCode::DbError));
assert!(!is_permanent_error(&ErrorCode::InternalError));
}
// --- Stage-aware outcomes ---
#[test]
fn degraded_outcome_does_not_count_as_failure() {
// When core stages succeed but optional stages fail, consecutive_failures should reset
let mut status = make_status("failed", 3, 100_000_000);
status.next_retry_at_ms = Some(200_000_000);
// Simulate degraded outcome clearing failure state
status.consecutive_failures = 0;
status.next_retry_at_ms = None;
assert_eq!(status.consecutive_failures, 0);
assert!(status.next_retry_at_ms.is_none());
}
// --- Backoff (service run only, NOT manual sync) ---
// (Test degraded state by running with --profile full when Ollama is down)
// Embeddings should fail, but issues/MRs should succeed
#[test]
fn service_run_degraded_outcome_clears_failures() {
let mut status = make_status("failed", 3, 100_000_000);
status.consecutive_failures = 3;
status.next_retry_at_ms = Some(200_000_000);
// Simulate degraded outcome clearing failure state
status.consecutive_failures = 0;
status.next_retry_at_ms = None;
assert_eq!(status.consecutive_failures, 0);
assert!(status.next_retry_at_ms.is_none());
}
// --- Circuit breaker ---
#[test]
fn circuit_breaker_trips_at_threshold() {
let mut status = make_status("failed", 9, 100_000_000);
// Incrementing to 10 should trigger circuit breaker
status.consecutive_failures = status.consecutive_failures.saturating_add(1);
assert_eq!(status.consecutive_failures, 10);
// Caller would set paused_reason = "CIRCUIT_BREAKER"
}
// --- Paused state (permanent error) ---
#[test]
fn clear_paused_resets_all_fields() {
let mut status = make_status("failed", 5, 100_000_000);
status.paused_reason = Some("AUTH_FAILED: 401 Unauthorized".to_string());
status.last_error_code = Some("AUTH_FAILED".to_string());
status.last_error_message = Some("401 Unauthorized".to_string());
status.next_retry_at_ms = Some(200_000_000);
status.circuit_breaker_paused_at_ms = Some(100_000_000);
status.clear_paused();
assert!(status.paused_reason.is_none());
assert!(status.circuit_breaker_paused_at_ms.is_none());
assert!(status.last_error_code.is_none());
assert!(status.last_error_message.is_none());
assert!(status.next_retry_at_ms.is_none());
assert_eq!(status.consecutive_failures, 0);
}
#[test]
fn clear_paused_also_clears_circuit_breaker() {
let mut status = make_status("failed", 10, 100_000_000);
status.paused_reason = Some("CIRCUIT_BREAKER: 10 consecutive transient failures".to_string());
status.clear_paused();
assert!(status.paused_reason.is_none());
assert_eq!(status.consecutive_failures, 0);
}
fn make_run(ts_ms: i64, outcome: &str) -> SyncRunRecord {
SyncRunRecord {
timestamp_iso: String::new(),
timestamp_ms: ts_ms,
duration_seconds: 1.0,
outcome: outcome.to_string(),
stage_results: vec![],
error_message: if outcome == "failed" {
Some("test error".into())
} else {
None
},
}
}
fn make_stage_result(stage: &str, success: bool, error_code: Option<&str>) -> StageResult {
StageResult {
stage: stage.to_string(),
success,
items_updated: if success { 5 } else { 0 },
error: if success { None } else { Some("test error".into()) },
error_code: error_code.map(|s| s.to_string()),
}
}
fn make_status(outcome: &str, failures: u32, ts_ms: i64) -> SyncStatusFile {
let run = make_run(ts_ms, outcome);
SyncStatusFile {
schema_version: 1,
updated_at_iso: String::new(),
last_run: Some(run.clone()),
recent_runs: vec![run],
consecutive_failures: failures,
next_retry_at_ms: None,
paused_reason: None,
circuit_breaker_paused_at_ms: None,
last_error_code: None,
last_error_message: None,
current_run: None,
}
}
}
Service Manifest Tests (in src/core/service_manifest.rs)
#[cfg(test)]
mod tests {
use super::*;
use tempfile::TempDir;
#[test]
fn manifest_round_trip() {
let dir = TempDir::new().unwrap();
let path = dir.path().join("manifest.json");
let manifest = ServiceManifest {
schema_version: 1,
service_id: "a1b2c3d4e5f6".to_string(),
workspace_root: "/Users/x/projects/my-project".to_string(),
installed_at_iso: "2026-02-09T10:00:00Z".to_string(),
updated_at_iso: "2026-02-09T10:00:00Z".to_string(),
platform: "launchd".to_string(),
interval_seconds: 900,
profile: "fast".to_string(),
binary_path: "/usr/local/bin/lore".to_string(),
config_path: None,
token_source: "env_file".to_string(),
token_env_var: "GITLAB_TOKEN".to_string(),
service_files: vec!["/Users/x/Library/LaunchAgents/com.gitlore.sync.a1b2c3d4e5f6.plist".to_string()],
sync_command: "/usr/local/bin/lore --robot service run".to_string(),
max_transient_failures: 10,
circuit_breaker_cooldown_seconds: 1800,
spec_hash: "abc123def456".to_string(),
};
manifest.write_atomic(&path).unwrap();
let loaded = ServiceManifest::read(&path).unwrap().unwrap();
assert_eq!(loaded.profile, "fast");
assert_eq!(loaded.interval_seconds, 900);
assert_eq!(loaded.service_id, "a1b2c3d4e5f6");
assert_eq!(loaded.max_transient_failures, 10);
assert_eq!(loaded.circuit_breaker_cooldown_seconds, 1800);
}
#[test]
fn manifest_read_missing_returns_ok_none() {
let dir = TempDir::new().unwrap();
assert!(ServiceManifest::read(&dir.path().join("nope.json")).unwrap().is_none());
}
#[test]
fn manifest_read_corrupt_returns_err() {
let dir = TempDir::new().unwrap();
let path = dir.path().join("bad.json");
std::fs::write(&path, "{{{{").unwrap();
assert!(ServiceManifest::read(&path).is_err());
}
#[test]
fn profile_to_sync_args_fast() {
let m = make_manifest("fast");
assert_eq!(m.profile_to_sync_args(), vec!["--no-docs", "--no-embed"]);
}
#[test]
fn profile_to_sync_args_balanced() {
let m = make_manifest("balanced");
assert_eq!(m.profile_to_sync_args(), vec!["--no-embed"]);
}
#[test]
fn profile_to_sync_args_full() {
let m = make_manifest("full");
assert!(m.profile_to_sync_args().is_empty());
}
#[test]
fn compute_service_id_deterministic() {
let urls = ["https://gitlab.com/group/repo"];
let id1 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
let id2 = compute_service_id(Path::new("/home/user/project"), Path::new("/home/user/.config/lore/config.json"), &urls);
assert_eq!(id1, id2);
assert_eq!(id1.len(), 12);
}
#[test]
fn compute_service_id_different_workspaces() {
let urls = ["https://gitlab.com/group/repo"];
let config = Path::new("/home/user/.config/lore/config.json");
let id1 = compute_service_id(Path::new("/home/user/project-a"), config, &urls);
let id2 = compute_service_id(Path::new("/home/user/project-b"), config, &urls);
assert_ne!(id1, id2); // Same config, different workspace => different IDs
}
#[test]
fn compute_service_id_different_configs() {
let urls = ["https://gitlab.com/group/repo"];
let workspace = Path::new("/home/user/project");
let id1 = compute_service_id(workspace, Path::new("/home/user1/config.json"), &urls);
let id2 = compute_service_id(workspace, Path::new("/home/user2/config.json"), &urls);
assert_ne!(id1, id2);
}
#[test]
fn compute_service_id_different_projects_same_config() {
let workspace = Path::new("/home/user/project");
let config = Path::new("/home/user/.config/lore/config.json");
let id1 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-a"]);
let id2 = compute_service_id(workspace, config, &["https://gitlab.com/group/repo-b"]);
assert_ne!(id1, id2); // Same config path, different projects => different IDs
}
#[test]
fn compute_service_id_url_order_independent() {
let workspace = Path::new("/home/user/project");
let config = Path::new("/config.json");
let id1 = compute_service_id(workspace, config, &["https://gitlab.com/a", "https://gitlab.com/b"]);
let id2 = compute_service_id(workspace, config, &["https://gitlab.com/b", "https://gitlab.com/a"]);
assert_eq!(id1, id2); // Order should not matter (sorted internally)
}
#[test]
fn sanitize_service_name_valid() {
assert_eq!(sanitize_service_name("my-project").unwrap(), "my-project");
assert_eq!(sanitize_service_name("MyProject").unwrap(), "myproject");
}
#[test]
fn sanitize_service_name_special_chars() {
assert_eq!(sanitize_service_name("my project!").unwrap(), "my-project-");
}
#[test]
fn sanitize_service_name_empty_rejects() {
assert!(sanitize_service_name("---").is_err());
assert!(sanitize_service_name("").is_err());
}
#[test]
fn sanitize_service_name_too_long() {
let long_name = "a".repeat(33);
assert!(sanitize_service_name(&long_name).is_err());
}
fn make_manifest(profile: &str) -> ServiceManifest { /* ... */ }
}
Platform-Specific Unit Tests
// In platform/launchd.rs
#[cfg(test)]
mod tests {
use super::*;
// --- Wrapper script variant (env-file, default) ---
#[test]
fn plist_wrapper_contains_scoped_label() {
let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
assert!(plist.contains("<string>com.gitlore.sync.abc123</string>"));
}
#[test]
fn plist_wrapper_invokes_wrapper_not_lore_directly() {
let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
assert!(plist.contains("<string>/data/service-run-abc123.sh</string>"));
// Should NOT contain direct lore invocation args
assert!(!plist.contains("<string>--robot</string>"));
assert!(!plist.contains("<string>service</string>"));
}
#[test]
fn plist_wrapper_does_not_contain_token() {
let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
assert!(!plist.contains("GITLAB_TOKEN"));
assert!(!plist.contains("glpat"));
}
#[test]
fn plist_wrapper_contains_interval() {
let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 900, Path::new("/tmp/logs"));
assert!(plist.contains("<integer>900</integer>"));
}
// --- Embedded variant ---
#[test]
fn plist_embedded_contains_token() {
let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
assert!(plist.contains("GITLAB_TOKEN"));
assert!(plist.contains("glpat-xxx"));
}
#[test]
fn plist_embedded_invokes_lore_directly() {
let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
assert!(plist.contains("<string>--robot</string>"));
assert!(plist.contains("<string>service</string>"));
assert!(plist.contains("<string>run</string>"));
}
#[test]
fn plist_embedded_xml_escapes_token() {
let plist = generate_plist_with_embedded(
"abc123", "/usr/local/bin/lore", None, 1800, "GITLAB_TOKEN", "tok&en<>", Path::new("/tmp/logs"),
);
assert!(plist.contains("tok&en<>"));
assert!(!plist.contains("tok&en<>"));
}
#[test]
fn plist_xml_escapes_paths_with_special_chars() {
let plist = generate_plist_with_embedded(
"abc123", "/Users/O'Brien/bin/lore", None, 1800, "GITLAB_TOKEN", "glpat-xxx",
Path::new("/tmp/logs"),
);
assert!(plist.contains("O'Brien"));
}
// --- Shared plist properties ---
#[test]
fn plist_has_background_process_type() {
let plist = generate_plist_with_wrapper("abc123", Path::new("/data/service-run-abc123.sh"), 1800, Path::new("/tmp/logs"));
assert!(plist.contains("<string>Background</string>"));
assert!(plist.contains("<integer>10</integer>")); // Nice
}
#[test]
fn plist_embedded_includes_config_path_when_provided() {
let plist = generate_plist_with_embedded("abc123", "/usr/local/bin/lore", Some("/custom/config.json"), 1800, "GITLAB_TOKEN", "glpat-xxx", Path::new("/tmp/logs"));
assert!(plist.contains("LORE_CONFIG_PATH"));
assert!(plist.contains("/custom/config.json"));
}
}
// In platform/systemd.rs
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn service_unit_contains_hardening() {
let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
assert!(unit.contains("NoNewPrivileges=true"));
assert!(unit.contains("PrivateTmp=true"));
assert!(unit.contains("ProtectSystem=strict"));
assert!(unit.contains("ProtectHome=read-only"));
assert!(unit.contains("TimeoutStartSec=900"));
assert!(unit.contains("WorkingDirectory=/data"));
assert!(unit.contains("SuccessExitStatus=0"));
}
#[test]
fn service_unit_env_file_mode() {
let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "env-file", Path::new("/data"));
assert!(unit.contains("EnvironmentFile=/data/service-env-abc123"));
assert!(!unit.contains("Environment=GITLAB_TOKEN="));
}
#[test]
fn service_unit_embedded_mode() {
let unit = generate_service("abc123", "/usr/local/bin/lore", None, "GITLAB_TOKEN", "glpat-xxx", "embedded", Path::new("/data"));
assert!(unit.contains("Environment=GITLAB_TOKEN=glpat-xxx"));
assert!(!unit.contains("EnvironmentFile="));
}
#[test]
fn timer_unit_contains_scoped_description() {
let timer = generate_timer("abc123", 900);
assert!(timer.contains("abc123"));
assert!(timer.contains("OnUnitInactiveSec=900s"));
}
}
Integration Tests (CLI parsing)
// In service/mod.rs
#[cfg(test)]
mod tests {
use clap::Parser;
use crate::cli::Cli;
#[test]
fn parse_service_install_default() {
let cli = Cli::try_parse_from(["lore", "service", "install"]).unwrap();
match cli.command {
Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
assert_eq!(interval, "30m");
assert_eq!(profile, "balanced");
assert_eq!(token_source, "env-file");
assert!(name.is_none());
}
_ => panic!("Expected Service Install"),
}
}
#[test]
fn parse_service_install_all_flags() {
let cli = Cli::try_parse_from([
"lore", "service", "install",
"--interval", "1h",
"--profile", "fast",
"--token-source", "embedded",
"--name", "my-project",
]).unwrap();
match cli.command {
Some(Commands::Service { command: ServiceCommand::Install { interval, profile, token_source, name } }) => {
assert_eq!(interval, "1h");
assert_eq!(profile, "fast");
assert_eq!(token_source, "embedded");
assert_eq!(name.as_deref(), Some("my-project"));
}
_ => panic!("Expected Service Install"),
}
}
#[test]
fn parse_service_uninstall() {
let cli = Cli::try_parse_from(["lore", "service", "uninstall"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Uninstall })
));
}
#[test]
fn parse_service_status() {
let cli = Cli::try_parse_from(["lore", "service", "status"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Status })
));
}
#[test]
fn parse_service_logs_default() {
let cli = Cli::try_parse_from(["lore", "service", "logs"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Logs { .. } })
));
}
#[test]
fn parse_service_logs_with_tail() {
let cli = Cli::try_parse_from(["lore", "service", "logs", "--tail", "50"]).unwrap();
// Verify tail flag is parsed
}
#[test]
fn parse_service_resume() {
let cli = Cli::try_parse_from(["lore", "service", "resume"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Resume })
));
}
#[test]
fn parse_service_doctor() {
let cli = Cli::try_parse_from(["lore", "service", "doctor"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Doctor { .. } })
));
}
#[test]
fn parse_service_doctor_offline() {
let cli = Cli::try_parse_from(["lore", "service", "doctor", "--offline"]).unwrap();
// Verify offline flag is parsed
}
#[test]
fn parse_service_run_hidden() {
let cli = Cli::try_parse_from(["lore", "service", "run"]).unwrap();
assert!(matches!(
cli.command,
Some(Commands::Service { command: ServiceCommand::Run })
));
}
}
Behavioral Tests (service run isolation)
// Verify that manual sync path is NOT affected by service state
#[test]
fn manual_sync_ignores_backoff_state() {
// Create a status file with active backoff
let dir = TempDir::new().unwrap();
let status_path = dir.path().join("sync-status-test1234.json");
let mut status = make_status("failed", 5, chrono::Utc::now().timestamp_millis());
status.next_retry_at_ms = Some(chrono::Utc::now().timestamp_millis() + 999_999_999);
status.write_atomic(&status_path).unwrap();
// handle_sync_cmd should NOT read this file at all
// (verified by the absence of any backoff check in handle_sync_cmd)
}
// Verify service run respects paused state
#[test]
fn service_run_respects_paused_state() {
let mut status = SyncStatusFile::default();
status.paused_reason = Some("AUTH_FAILED".to_string());
// handle_service_run should check paused_reason BEFORE backoff
// and exit with action: "paused"
}
// Verify degraded outcome clears failure counter
#[test]
fn service_run_degraded_clears_failures() {
let mut status = make_status("failed", 3, 100_000_000);
status.next_retry_at_ms = Some(200_000_000);
// After a degraded run (core OK, optional failed):
status.consecutive_failures = 0;
status.next_retry_at_ms = None;
assert_eq!(status.consecutive_failures, 0);
}
// Verify circuit breaker trips at threshold
#[test]
fn service_run_circuit_breaker_trips() {
let mut status = make_status("failed", 9, 100_000_000);
status.consecutive_failures = status.consecutive_failures.saturating_add(1);
// At 10 failures, should set paused_reason
if status.consecutive_failures >= 10 {
status.paused_reason = Some("CIRCUIT_BREAKER".to_string());
}
assert!(status.paused_reason.is_some());
}
New Dependencies
Two new crates:
| Crate | Version | Purpose | Justification |
|---|---|---|---|
sha2 |
0.10 |
Compute service_id from config path |
Small, well-audited, no-std compatible. Used for exactly one hash computation. |
hex |
0.4 |
Encode hash bytes to hex string | Tiny utility, widely used. |
Note on
rand: TheJitterRngtrait usesrand::thread_rng()in production. Check ifrandis already a transitive dependency (via other crates). If so, add it as a direct dependency. If not, consider using a simpler PRNG or system randomness viagetrandomto avoid pulling in the fullrandcrate for a single call site. TheJitterRngtrait abstracts this, so the implementation can change without affecting the API.
Existing dependencies used:
std::process::Command— for launchctl, systemctl, schtasksformat!()— for plist XML and systemd unit templatesstd::env::current_exe()— for binary path resolutionserde+serde_json(existing) — for status/manifest fileschrono(existing) — for timestampsdirs(existing) — for home directorylibc(existing, unix only) — forgetuid()console(existing) — for colored human outputtempfile(existing, dev dep) — for test temp dirs
Implementation Order
Phase 1: Core types (standalone, fully testable)
Cargo.toml— addsha2,hexdependencies (andrandif not already transitive)src/core/sync_status.rs—SyncRunRecord,StageResult(witherror_code),SyncStatusFile(withcircuit_breaker_paused_at_ms,current_run),CurrentRunState,Clocktrait,JitterRngtrait,parse_interval,is_permanent_error,is_permanent_stage_error,is_circuit_breaker_half_open,extract_retry_after_hint, atomic write helper, schema migration on read, all unit testssrc/core/service_manifest.rs—ServiceManifest(withcircuit_breaker_cooldown_seconds,workspace_root,spec_hash),DiagnosticCheck,DiagnosticStatus,compute_service_id(workspace_root, config_path, project_urls),sanitize_service_name,compute_spec_hash(service_files_content), profile mapping, atomic write helper, schema migration on read, unit testssrc/core/error.rs— addServiceError,ServiceUnsupported,ServiceCommandFailed,ServiceCorruptStatesrc/core/paths.rs— addget_service_status_path(service_id),get_service_manifest_path(service_id),get_service_env_path(service_id),get_service_wrapper_path(service_id),get_service_log_path(service_id, stream),list_service_ids()src/core/mod.rs— addpub mod sync_status; pub mod service_manifest;
Phase 2: Platform backends (parallelizable across platforms)
src/cli/commands/service/platform/mod.rs— dispatch functions (withservice_id),run_cmd(with kill+reap on timeout),wait_with_timeout_kill_and_reap,xml_escape,write_token_env_file,write_wrapper_script,write_atomic,check_prerequisitessrc/cli/commands/service/platform/launchd.rs— macOS backend with wrapper script (env-file) and embedded variants, project-scoped label + prerequisite checks + testssrc/cli/commands/service/platform/systemd.rs— Linux backend with hardened unit (WorkingDirectory, SuccessExitStatus), project-scoped names, linger/user-manager checks + testssrc/cli/commands/service/platform/schtasks.rs— Windows backend with project-scoped task name
Phase 3: Command handlers
src/cli/commands/service/doctor.rs— pre-flight diagnostic checks (used by install and standalone)src/cli/commands/service/install.rs— install handler with transactional ordering (enable then manifest), wrapper script generation, doctor pre-flight, service_idsrc/cli/commands/service/uninstall.rs— uninstall handler with--service/--allselectors (removes manifest + env file + wrapper script)src/cli/commands/service/list.rs— list handler (scans data_dir for manifests, verifies platform state)src/cli/commands/service/status.rs— status handler with scheduler state includingdegradedandhalf_opensrc/cli/commands/service/logs.rs— logs handler with default tail output,--openfor editor,--follow, log rotation checksrc/cli/commands/service/resume.rs— resume handler (clears paused + circuit breaker)src/cli/commands/service/pause.rs— pause handler (sets manual pause reason)src/cli/commands/service/trigger.rs— trigger handler (immediate run with optional backoff bypass)src/cli/commands/service/repair.rs— repair handler (backup corrupt files, reinitialize)src/cli/commands/service/run.rs— hidden scheduled execution entrypoint with stage-aware execution, circuit breaker, half-open probe, log rotationsrc/cli/commands/service/mod.rs— re-exports +resolve_service_idhelper
Phase 4: CLI wiring
src/cli/mod.rs—ServiceCommandinCommandsenum (with all new subcommands and flags)src/cli/commands/mod.rs—pub mod service;src/main.rs— dispatch + pipeline lock inhandle_sync_cmd+ robot-docs manifestsrc/cli/autocorrect.rs— add service entry with all flags
Phase 5: Verification
cargo check --all-targets && cargo clippy --all-targets -- -D warnings && cargo test && cargo fmt --check
Verification Checklist
# Build and lint
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt --check
# Run all tests
cargo test
# --- Doctor (run first to verify prerequisites) ---
cargo run --release -- service doctor
cargo run --release -- -J service doctor | jq '.data.overall' # should show "pass" or "warn"
cargo run --release -- -J service doctor --offline | jq .
cargo run --release -- -J service doctor --fix | jq '.data.checks[] | select(.status == "fixed")'
# --- Dry-run install (should write nothing) ---
cargo run --release -- -J service install --interval 15m --profile fast --dry-run | jq '.data.dry_run' # true
launchctl list | grep gitlore # should NOT be present
# --- Install (macOS) ---
cargo run --release -- service install --interval 15m --profile fast
launchctl list | grep gitlore
cargo run --release -- -J service status | jq '.data.service_id' # should show hash
cargo run --release -- service logs --tail 5
cargo run --release -- service uninstall
launchctl list | grep gitlore # should be gone
# Verify install with custom name
cargo run --release -- service install --interval 30m --name my-project
launchctl list | grep gitlore # should show com.gitlore.sync.my-project
cargo run --release -- -J service status | jq '.data.service_id' # "my-project"
cargo run --release -- service uninstall
# Verify install idempotency
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service install --interval 30m # should report no_change: true
cargo run --release -- -J service install --interval 15m # should report changes
cargo run --release -- service uninstall
# --- Service run (use `service trigger` for manual testing, or provide --service-id) ---
cargo run --release -- -J service install --interval 30m
SVC_ID=$(cargo run --release -- -J service status | jq -r '.data.service_id')
cargo run --release -- -J service trigger # preferred way to manually invoke a service run
cargo run --release -- -J service status | jq '.data.recent_runs' # should show the run
cargo run --release -- -J service status | jq '.data.last_sync.stage_results' # per-stage outcomes
# --- Stage-aware outcomes ---
# (Test degraded state by running with --profile full when Ollama is down)
# Embeddings should fail, but issues/MRs should succeed
cargo run --release -- -J service install --profile full
# Stop Ollama, then:
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.outcome' # "degraded"
cargo run --release -- -J service status | jq '.data.scheduler_state' # "degraded"
# --- Backoff (service run only, NOT manual sync) ---
# 1. Create a status file simulating failures
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
"schema_version": 1,
"updated_at_iso": "2026-02-09T10:00:00Z",
"last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":TIMESTAMP,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"test"},
"recent_runs": [],
"consecutive_failures": 3,
"next_retry_at_ms": FUTURE_MS,
"paused_reason": null,
"last_error_code": null,
"last_error_message": null,
"circuit_breaker_paused_at_ms": null
}
EOF
# Replace timestamps: sed -i '' "s/TIMESTAMP/$(date +%s)000/;s/FUTURE_MS/$(($(date +%s)*1000 + 3600000))/" ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json
# 2. Service run should skip (backoff)
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action' # "skipped"
# 3. Manual sync should NOT be affected
cargo run --release -- sync # should proceed normally
# --- Paused state (permanent error) ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
"schema_version": 1,
"updated_at_iso": "2026-02-09T10:00:00Z",
"last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"401 Unauthorized"},
"recent_runs": [],
"consecutive_failures": 1,
"next_retry_at_ms": null,
"paused_reason": "AUTH_FAILED: 401 Unauthorized",
"last_error_code": "AUTH_FAILED",
"last_error_message": "401 Unauthorized",
"circuit_breaker_paused_at_ms": null
}
EOF
# Service run should report paused
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action' # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason' # "AUTH_FAILED"
# Resume clears the state
cargo run --release -- -J service resume | jq . # clears circuit breaker
# --- Circuit breaker ---
cat > ~/.local/share/lore/sync-status-a1b2c3d4e5f6.json << 'EOF'
{
"schema_version": 1,
"updated_at_iso": "2026-02-09T10:00:00Z",
"last_run": {"timestamp_iso":"2026-02-09T10:00:00Z","timestamp_ms":0,"duration_seconds":1.0,"outcome":"failed","stage_results":[],"error_message":"connection refused"},
"recent_runs": [],
"consecutive_failures": 10,
"next_retry_at_ms": null,
"paused_reason": "CIRCUIT_BREAKER: 10 consecutive transient failures",
"last_error_code": "TRANSIENT",
"last_error_message": "connection refused",
"circuit_breaker_paused_at_ms": 1770609000000
}
EOF
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action' # "paused"
cargo run --release -- -J service status | jq '.data.paused_reason' # "CIRCUIT_BREAKER"
cargo run --release -- -J service resume | jq . # clears circuit breaker
# --- Robot mode for all commands ---
cargo run --release -- -J service install --interval 30m | jq .
cargo run --release -- -J service list | jq .
cargo run --release -- -J service status | jq .
cargo run --release -- -J service logs --tail 10 | jq .
cargo run --release -- -J service doctor | jq .
cargo run --release -- -J service pause --reason "test" | jq .
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq .
cargo run --release -- -J service repair | jq .
cargo run --release -- -J service uninstall | jq .
# --- New operational commands ---
cargo run --release -- -J service install --interval 30m
cargo run --release -- -J service pause --reason "maintenance"
cargo run --release -- -J service status | jq '.data.scheduler_state' # "paused"
cargo run --release -- -J service run --service-id $SVC_ID| jq '.data.action' # "paused"
cargo run --release -- -J service resume | jq .
cargo run --release -- -J service trigger | jq . # immediate sync
cargo run --release -- -J service list | jq '.data.services'
cargo run --release -- service uninstall
# --- Token env file security (macOS/Linux) ---
cargo run --release -- service install --interval 30m
ls -la ~/.local/share/lore/service-env-* # should show -rw------- permissions
# On macOS, verify wrapper script exists and token NOT in plist:
ls -la ~/.local/share/lore/service-run-* # should show -rwx------ permissions
grep -c GITLAB_TOKEN ~/Library/LaunchAgents/com.gitlore.sync.*.plist # should be 0 (env-file mode)
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-env-* # should be gone (uninstall removes it)
ls ~/.local/share/lore/service-run-* # should be gone (uninstall removes wrapper)
# --- Manifest persistence ---
cargo run --release -- service install --interval 15m --profile full
cat ~/.local/share/lore/service-manifest-*.json | jq . # should show manifest with service_id
cargo run --release -- service uninstall
ls ~/.local/share/lore/service-manifest-* # should be gone
# --- Logs with tail/follow ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service run --service-id $SVC_ID # generate some log output
cargo run --release -- service logs --tail 20 # show last 20 lines
# cargo run --release -- service logs --follow # (interactive — Ctrl-C to stop)
# --- Uninstall cleanup ---
cargo run --release -- service install --interval 30m
cargo run --release -- -J service uninstall | jq '.data.removed_files'
# Verify status file and logs are kept
ls ~/.local/share/lore/sync-status-*.json # should exist
ls ~/.local/share/lore/logs/ # should exist
# --- Repair command ---
# Corrupt a status file to test repair
echo "{{{" > ~/.local/share/lore/sync-status-test.json
cargo run --release -- -J service repair | jq . # should backup and reinitialize
# --- Final cleanup ---
cargo run --release -- service uninstall 2>/dev/null
rm -f ~/.local/share/lore/sync-status-*.json
Rejected Recommendations
Recommendations from external reviewers that were considered and explicitly rejected. Kept here to prevent re-proposal.
-
Unified
SyncOrchestratorfor manual and scheduled sync (feedback-4, rec 4) — rejected because manual and scheduled sync have fundamentally different policies (backoff/circuit-breaker vs. none). A shared orchestrator adds abstraction without clear benefit. The current approach (separate paths with shared pipeline lock) is simpler, correct, and avoids coupling the manual path to service-layer concerns. The two paths share the sync pipeline implementation itself; only the policy wrapper differs. -
autotoken strategy with secure-store (Keychain / libsecret / Credential Manager) as default (feedback-2 rec 2, feedback-4 rec 7) — rejected because adding platform-specific secure store dependencies (security-framework,libsecret,winapi) is heavy for v1. The wrapper-script approach (already in the plan) keeps the token out of the plist safely on macOS. The plan notes secure-store as a future enhancement. The token validation fix (rejecting NUL/newline) from feedback-4 rec 7 was accepted separately. -
Store service state in SQLite instead of JSON status file (feedback-1, rec 2) — rejected because the status file is intentionally independent of the database. This avoids coupling service lifecycle to DB migrations, enables service operation when the DB is locked/corrupt, and keeps the service layer self-contained. The JSON file approach with atomic writes is adequate for single-writer status tracking.
-
write_seqandcontent_sha256integrity fields in manifest/status files (feedback-4, rec 6 partial) — rejected because this is over-engineering for a status file that is written by a single process with atomic writes. Theservice repaircommand already handles corrupt files by backup+reinit. The fsync(parent_dir) improvement from rec 6 was accepted separately. -
Use
nixcrate for safe UID access (feedback-4, rec 8 partial) — rejected as a mandatory dependency becausegetuid()is trivially safe (no pointers, no mutation) and addingnixfor a single call is disproportionate. A single-line safe wrapper with#[allow(unsafe_code)]is sufficient. Ifnixis already a dependency for other reasons, using it is fine. -
Mandatory dual-lock acquisition with strict ordering for uninstall/run races (feedback-5, rec 2) — rejected because the existing plan already has admin lock for destructive ops and pipeline lock for runs. The race window (scheduler fires during uninstall) is tiny, the consequence is benign (service runs, finds no manifest, exits 0), and mandatory lock ordering with dual acquisition adds significant complexity. The plan's existing separation (admin lock for state mutations, pipeline lock for data writes) is sufficient.
-
Decoupled optional stage cadence from core sync interval (feedback-5, rec 4) — rejected because separate freshness windows per stage (e.g., "docs every 60m, embeddings every 6h") add significant complexity: new config fields per stage, last-success tracking per stage, skip logic, and confusing profile semantics. The existing profile system already solves this more simply: use
fastfor frequent intervals (issues+MRs only),balancedorfullfor less frequent intervals that include heavier stages. -
Windows env-file parity via wrapper script (feedback-5, rec 5) — rejected because Windows Task Scheduler has fundamentally different environment handling than launchd/systemd. A wrapper
.cmdor.ps1script introduces fragility (quoting, encoding, UAC edge cases, PowerShell execution policy) for marginal benefit. The currentsystem_envapproach is honest, works reliably, and Windows users are accustomed to system environment variables. Future Credential Manager integration (already noted as deferred) is the right long-term solution. -
--regenerateflag on service repair (feedback-5, rec 7 partial) — rejected becauselore service installis already idempotent (detects existing manifest, overwrites if config differs). Regenerating scheduler artifacts is exactly what a re-install does. Adding--regenerateto repair creates a confusing second path to the same outcome. Thespec_hashdrift detection (accepted from this rec) gives users clear diagnostics; the remedy is simplylore service install.