diff --git a/BACKBURNER.md b/BACKBURNER.md new file mode 100644 index 0000000..dd73ba9 --- /dev/null +++ b/BACKBURNER.md @@ -0,0 +1,10 @@ +# Backburner Ideas + +Last updated: February 23, 2026 + +## Deferred + +- Separate pricing semantics for subscription vs API usage: + - `cburn` should model Claude subscription economics independently from Admin/API token pricing. + - Admin Cost API integration is still valuable for API-key workflows, but should not be treated as the canonical source for subscription usage. + - Revisit after daemon/event infrastructure work stabilizes. diff --git a/CEO_PITCH_DECKS.md b/CEO_PITCH_DECKS.md new file mode 100644 index 0000000..7794176 --- /dev/null +++ b/CEO_PITCH_DECKS.md @@ -0,0 +1,236 @@ +# CEO Meeting Script: Four Daemon-Enabled Product Bets + +## Opening +Today I want to walk through four product bets that become possible because we now have a continuously running daemon that can observe usage as it happens, not just after the fact. + +I am not presenting speculative AI magic. I am presenting four concrete products, each with a hard-nosed view of utility, feasibility, risk, and build path. + +The four bets are: +1. Cognitive Flight Recorder +2. Runaway Loop Quencher +3. Session Archeology Engine +4. Latent Tool ROI Scanner + +My recommendation is not “build all at once.” My recommendation is staged execution with clear kill criteria. + +--- + +## 1) Cognitive Flight Recorder + +### The pitch +If our AI spend spikes tomorrow, leadership will ask two questions immediately: what happened, and why did it happen. Right now, we can answer “how much,” but we cannot reliably answer “why.” + +The Cognitive Flight Recorder solves that. It turns a costly session into a replayable incident timeline. Not a dashboard snapshot. A sequence: where cost accelerated, where model behavior changed, where cache efficiency collapsed, and where the session crossed from productive to expensive. + +This product creates operational trust. When AI systems are expensive, trust depends on explainability under stress. + +### Why we need it +Postmortems are currently slow and anecdotal. Engineers reconstruct stories manually. Finance gets numbers without causality. Leadership gets noise. + +A Flight Recorder makes AI spend investigable the same way we investigate reliability incidents. That is a major enterprise unlock. + +### How it works +The daemon continuously captures session telemetry and emits timeline events. The recorder layers on top: +- It builds per-session event streams with timestamps, token deltas, model transitions, and cache transitions. +- It detects inflection points where cost trajectory changes materially. +- It generates a concise incident report: “what changed,” “likely causes,” and “which preventive policy would have helped.” + +The output is not just visual. It is operational: replay + evidence + recommended guardrail. + +### Downstream effects +If we execute this well, we get: +- Faster incident resolution for spend spikes. +- Better policy tuning because we can pinpoint the moment of failure. +- Stronger executive confidence in scaling agent usage. +- A compelling enterprise story: “we can explain every anomaly.” + +### Skeptical view +Is it actually useful? Only if it leads to action. A pretty timeline that no one uses is dead weight. + +Is it feasible with current harnesses? Partially. We can do strong metadata-level replay now. Deep semantic replay depends on richer telemetry and raises privacy concerns. + +The critical risk is false causality: users may confuse sequence with cause. We mitigate this by attaching confidence levels and explicit evidence for every claim. + +### Implementation roadmap +Phase 1 (2-3 weeks): metadata replay and “Top Cost Incidents” report. +Phase 2 (3-5 weeks): inflection detection and root-cause ranking with confidence scoring. +Phase 3 (4+ weeks): optional deep replay, privacy controls, and incident workflow integrations. + +### Decision rule +Proceed if incident reports result in measurable policy changes. Kill or narrow if they remain passive observability artifacts. + +--- + +## 8) Runaway Loop Quencher + +### The pitch +Most bad AI spend is not one bad call. It is a loop: repeated expensive behavior with little progress. If we only detect this after the session, we are too late. + +The Runaway Loop Quencher is an active safety layer. It watches live telemetry, identifies likely runaway patterns, and intervenes before the burn compounds. + +This is the direct path to cost containment at runtime. + +### Why we need it +Without active containment, scaling agent autonomy is financially unsafe. Teams become conservative. Leaders reduce usage. Innovation slows. + +If we can intervene mid-flight, we convert catastrophic sessions into manageable sessions. + +### How it works +The daemon computes rolling risk signals: +- accelerating cost per minute +- repetitive call signatures +- degrading cache performance +- high token growth with weak progress proxies + +A policy engine converts those signals into action tiers: +- Soft: alert and suggest a reset strategy +- Guarded: require confirmation before continuing expensive patterns +- Hard: stop execution for supported harnesses + +We start advisory-first, then move toward control where integrations allow. + +### Downstream effects +If successful: +- fewer runaway incidents +- lower variance in daily spend +- greater confidence in letting agents run longer on valuable tasks +- ability to define budget safety SLOs + +### Skeptical view +Is it actually useful? Yes, but only if precision is good. High false positives will cause immediate distrust and disablement. + +Is it feasible given harness reality? Detection and alerting are feasible now. Hard-stop control is integration-dependent and not universally available. + +The hard technical challenge is “progress.” We can estimate risk, but progress is not always machine-observable. That means we should not over-automate too early. + +### Implementation roadmap +Phase 1 (2-4 weeks): risk scoring, alerting, and daemon risk endpoint. +Phase 2 (4-6 weeks): human confirmation gates and cooldown policies. +Phase 3 (6+ weeks): optional hard-stop integrations and policy simulation. + +### Decision rule +Ship only if we can keep false positives low enough that teams keep it enabled. If intervention is frequently wrong, this product should remain advisory. + +--- + +## 9) Session Archeology Engine + +### The pitch +Right now we can tell teams they spent too much. We cannot tell them which recurring behavior patterns caused it. + +The Session Archeology Engine classifies sessions into behavioral archetypes and ties each archetype to practical intervention playbooks. + +This turns raw telemetry into behavior change. + +### Why we need it +People do not improve from aggregate numbers. They improve from named patterns and concrete alternatives. + +If we can say, “These two session archetypes account for most avoidable spend, and here is exactly how to run them differently,” we create durable cost literacy. + +### How it works +We extract session-level feature vectors: +- session shape and duration profile +- token composition and burstiness +- cache behavior +- model mix and switch behavior +- retry and repetition patterns + +We cluster sessions and assign human-readable archetypes, then connect each archetype to: +- likely waste mechanism +- recommended policy/routing pattern +- suggested prompt and workflow changes + +The output is both analytical and prescriptive. + +### Downstream effects +If this works: +- managers coach with evidence instead of intuition +- teams adopt archetype-specific best practices +- routing policies improve faster because they target behaviors, not averages +- executives get clean narrative reporting on spend dynamics + +### Skeptical view +Is it actually useful? It is useful only if archetypes stay stable and map to actions. Otherwise it becomes taxonomy theater. + +Is it feasible? Yes, baseline version is feasible with existing metadata. Advanced value improves with richer tool and outcome signals. + +Main risk: labels can drift as models and workflows change. We mitigate with periodic retraining, versioned labels, and strict “action attached” requirements. + +### Implementation roadmap +Phase 1 (2-3 weeks): clustering baseline and weekly archetype report. +Phase 2 (3-5 weeks): intervention playbooks and policy recommendations per archetype. +Phase 3 (4+ weeks): team benchmarking and archetype drift alerts. + +### Decision rule +Keep investing only if archetypes produce measurable behavior and cost improvements, not just better reporting. + +--- + +## 13) Latent Tool ROI Scanner + +### The pitch +Model choice is not the only cost lever. Tool behavior often dominates spend efficiency, and today that layer is mostly invisible. + +The Latent Tool ROI Scanner identifies which tools and workflows consume disproportionate cost relative to useful outcome, and recommends what to constrain, replace, or redesign. + +This is potentially the highest upside concept, but also the highest epistemic risk. + +### Why we need it +Optimization efforts usually target visible levers. Hidden tool-level waste can remain untouched for months. + +If we can reveal negative-ROI tool patterns, we unlock savings without reducing strategic AI adoption. + +### How it works +The scanner combines daemon telemetry with richer tool-event instrumentation: +- per-tool invocation frequency and cost footprint +- failure and retry signatures +- outcome proxies from delivery systems (tests, merges, ticket transitions) + +It then computes conservative ROI scores and counterfactual scenarios: +- “If we reduce this pattern by 30%, estimated impact is X with confidence band Y.” + +Recommendations are always evidence-backed and confidence-scored. + +### Downstream effects +If accurate: +- identifies hidden spend sinks +- informs platform/tooling investments +- enables high-leverage policy changes with limited developer friction +- strengthens unit economics of agent operations + +### Skeptical view +Is it actually useful today? Not fully. Without stronger outcome labeling, ROI claims can become fragile or misleading. + +Is it feasible with current harnesses? Partially. We can pilot scoring frameworks, but high-confidence production decisions require instrumentation we do not yet have. + +This is exactly where we should avoid overclaiming. + +### Implementation roadmap +Phase 0 (1-2 weeks): instrumentation gap audit and schema design. +Phase 1 (3-4 weeks): tool-event ingestion and normalization pipeline. +Phase 2 (4-6 weeks): conservative ROI scoring + confidence intervals. +Phase 3 (4+ weeks): recommendation engine and controlled experiments. + +### Decision rule +Treat as pilot until precision is validated against human review and external outcomes. If precision is weak, keep this as exploratory analytics. + +--- + +## Portfolio recommendation and sequencing + +If we prioritize for impact times feasibility: +1. Cognitive Flight Recorder +2. Session Archeology Engine +3. Runaway Loop Quencher (advisory first, control later) +4. Latent Tool ROI Scanner (pilot behind instrumentation gate) + +This sequencing gives us near-term value while building the telemetry foundation needed for the harder products. + +The overarching principle: every insight must be tied to an action, every action must be measurable, and every high-stakes claim must carry confidence. + +## Closing +The daemon turns our system from retrospective analytics into a live control surface. These four products are how we monetize and operationalize that shift. + +The question is not whether these ideas are interesting. The question is whether we can ship them with enough truthfulness that teams trust them. + +With staged delivery and strict kill criteria, we can.