Files

teernisse ac5602e565 docs(plans): expand asupersync migration with decision gates, rollback, and invariants

Major additions to the migration plan based on review feedback:

Alternative analysis:
- Add "Why not tokio CancellationToken + JoinSet?" section explaining
  why obligation tracking and single-migration cost favor asupersync
  over incremental tokio fixes.

Error handling depth:
- Add NetworkErrorKind enum design for preserving error categories
  (timeout, DNS, TLS, connection refused) without coupling LoreError
  to any HTTP client.
- Add response body size guard (64 MiB) to prevent unbounded memory
  growth from misconfigured endpoints.

Adapter layer refinements:
- Expand append_query_params with URL fragment handling, edge case
  docs, and doc comments.
- Add contention constraint note for std::sync::Mutex rate limiter.

Cancellation invariants (INV-1 through INV-4):
- Atomic batch writes, no .await between tx open/commit,
  ShutdownSignal + region cancellation complementarity.
- Concrete test plan for each invariant.

Semantic ordering concerns:
- Document 4 behavioral differences when replacing join_all with
  region-spawned tasks (ordering, error aggregation, backpressure,
  late result loss on cancellation).

HTTP behavior parity:
- Replace informational table with concrete acceptance criteria and
  pass/fail tests for redirects, proxy, keep-alive, DNS, TLS, and
  Content-Length.

Phasing refinements:
- Add Cx threading sub-steps (orchestration path first, then
  command/embedding layer) for blast radius reduction.
- Add decision gate between Phase 0d and Phase 1 requiring compile +
  behavioral smoke tests before committing to runtime swap.

Rollback strategy:
- Per-phase rollback guidance with concrete escape hatch triggers
  (nightly breakage > 7d, TLS incompatibility, API instability,
  wiremock issues).

Testing depth:
- Adapter-layer test gap analysis with 5 specific asupersync-native
  integration tests.
- Cancellation integration test specifications.
- Coverage gap documentation for wiremock-on-tokio tests.

Risk register additions:
- Unbounded response body buffering, manual URL/header handling
  correctness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-06 13:36:56 -05:00

39 KiB

Raw Permalink Blame History

Plan: Replace Tokio + Reqwest with Asupersync

Date: 2026-03-06 Status: Draft Decisions: Adapter layer (yes), timeouts in adapter, deep Cx threading, reference doc only

Context

Gitlore uses tokio as its async runtime and reqwest as its HTTP client. Both work, but:

Ctrl+C during join_all silently drops in-flight HTTP requests with no cleanup
ShutdownSignal is a hand-rolled AtomicBool with no structured cancellation
No deterministic testing for concurrent ingestion patterns
tokio provides no structured concurrency guarantees

Asupersync is a cancel-correct async runtime with region-owned tasks, obligation tracking, and deterministic lab testing. Replacing tokio+reqwest gives us structured shutdown, cancel-correct ingestion, and testable concurrency.

Trade-offs accepted:

Nightly Rust required (asupersync dependency)
Pre-1.0 runtime dependency (mitigated by adapter layer + version pinning)
Deeper function signature changes for Cx threading

Why not tokio CancellationToken + JoinSet?

The core problems (Ctrl+C drops requests, no structured cancellation) can be fixed without replacing the runtime. Tokio's CancellationToken + JoinSet + explicit task tracking gives structured cancellation for fan-out patterns. This was considered and rejected for two reasons:

Obligation tracking is the real win. CancellationToken/JoinSet fix the "cancel cleanly" problem but don't give us obligation tracking (compile-time proof that all spawned work is awaited) or deterministic lab testing. These are the features that prevent future concurrency bugs, not just the current Ctrl+C issue.
Separation of concerns. Fixing Ctrl+C with tokio primitives first, then migrating the runtime second, doubles the migration effort (rewrite fan-out twice). Since we have no users and no backwards compatibility concerns, a single clean migration is lower total cost.

If asupersync proves unviable (nightly breakage, API instability), the fallback is exactly this: tokio + CancellationToken + JoinSet.

Current Tokio Usage Inventory

Production code (must migrate)

Location	API	Purpose
`main.rs:53`	`#[tokio::main]`	Runtime entrypoint
`main.rs` (4 sites)	`tokio::spawn` + `tokio::signal::ctrl_c`	Ctrl+C signal handlers
`gitlab/client.rs:9`	`tokio::sync::Mutex`	Rate limiter lock
`gitlab/client.rs:10`	`tokio::time::sleep`	Rate limiter backoff
`gitlab/client.rs:729,736`	`tokio::join!`	Parallel pagination

Production code (reqwest -- must replace)

Location	Usage
`gitlab/client.rs`	REST API: GET with headers/query, response status/headers/JSON, pagination via x-next-page and Link headers, retry on 429
`gitlab/graphql.rs`	GraphQL: POST with Bearer auth + JSON body, response JSON parsing
`embedding/ollama.rs`	Ollama: GET health check, POST JSON embedding requests

Test code (keep on tokio via dev-dep)

File	Tests	Uses wiremock?
`gitlab/graphql_tests.rs`	30	Yes
`gitlab/client_tests.rs`	4	Yes
`embedding/pipeline_tests.rs`	4	Yes
`ingestion/surgical_tests.rs`	4 async	Yes

Test code (switch to asupersync)

File	Tests	Why safe
`core/timeline_seed_tests.rs`	13	Pure CPU/SQLite, no HTTP, no tokio APIs

Test code (already sync `#[test]` -- no changes)

~35 test files across documents/, core/, embedding/, gitlab/transformers/, ingestion/, cli/commands/, tests/

Phase 0: Preparation (no runtime change)

Goal: Reduce tokio surface area before the swap. Each step is independently valuable.

0a. Extract signal handler

The 4 identical Ctrl+C handlers in main.rs (lines 1020, 2341, 2493, 2524) become one function in core/shutdown.rs:

pub fn install_ctrl_c_handler(signal: ShutdownSignal) {
    tokio::spawn(async move {
        let _ = tokio::signal::ctrl_c().await;
        eprintln!("\nInterrupted, finishing current batch... (Ctrl+C again to force quit)");
        signal.cancel();
        let _ = tokio::signal::ctrl_c().await;
        std::process::exit(130);
    });
}

4 spawn sites -> 1 function. The function body changes in Phase 3.

0b. Replace tokio::sync::Mutex with std::sync::Mutex

In gitlab/client.rs, the rate limiter lock guards a tiny sync critical section (check Instant::now(), compute delay). No async work inside the lock. std::sync::Mutex is correct and removes a tokio dependency:

// Before
use tokio::sync::Mutex;
let delay = self.rate_limiter.lock().await.check_delay();

// After
use std::sync::Mutex;
let delay = self.rate_limiter.lock().expect("rate limiter poisoned").check_delay();

Note: .expect() over .unwrap() for clarity. Poisoning is near-impossible here (the critical section is a trivial Instant::now() check), but the explicit message aids debugging if it ever fires.

Contention constraint: std::sync::Mutex blocks the executor thread while held. This is safe only because the critical section is a single Instant::now() comparison with no I/O. If the rate limiter ever grows to include async work (HTTP calls, DB queries), it must move back to an async-aware lock. Document this constraint with a comment at the lock site.

0c. Replace tokio::join! with futures::join!

In gitlab/client.rs:729,736. futures::join! is runtime-agnostic and already in deps.

After Phase 0, remaining tokio in production code:

#[tokio::main] (1 site)
tokio::spawn + tokio::signal::ctrl_c (1 function)
tokio::time::sleep (1 import)

Phase 0d: Error Type Migration (must precede adapter layer)

The adapter layer (Phase 1) uses GitLabNetworkError { detail: Option<String> }, which requires this error type change before the adapter compiles. Placed here so Phases 1-3 compile as a unit.

`src/core/error.rs`

// Remove:
#[error("HTTP error: {0}")]
Http(#[from] reqwest::Error),

// Change:
#[error("Cannot connect to GitLab at {base_url}")]
GitLabNetworkError {
    base_url: String,
    // Before: source: Option<reqwest::Error>
    // After:
    detail: Option<String>,
},

The adapter layer stringifies HTTP client errors at the boundary so LoreError doesn't depend on any HTTP client's error types. This also means the existing reqwest call sites that construct GitLabNetworkError must be updated to pass detail: Some(format!("{e:?}")) instead of source: Some(e) -- but those sites are rewritten in Phase 2 anyway, so no extra work.

Note on error granularity: Flattening all HTTP errors to detail: Option<String> loses the distinction between timeouts, TLS failures, DNS resolution failures, and connection resets. To preserve actionable error categories without coupling LoreError to any HTTP client, add a lightweight NetworkErrorKind enum:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum NetworkErrorKind {
    Timeout,
    ConnectionRefused,
    DnsResolution,
    Tls,
    Other,
}

#[error("Cannot connect to GitLab at {base_url}")]
GitLabNetworkError {
    base_url: String,
    kind: NetworkErrorKind,
    detail: Option<String>,
},

The adapter's execute() method classifies errors at the boundary:

Timeout from asupersync::time::timeout → NetworkErrorKind::Timeout
Transport errors from the HTTP client → classified by error type into the appropriate kind
Unknown errors → NetworkErrorKind::Other

This keeps LoreError client-agnostic while preserving the ability to make retry decisions based on error type (e.g., retry on timeout but not on TLS). The adapter's execute() method is the single place where this mapping happens, so adding new kinds is localized.

Phase 1: Build the HTTP Adapter Layer

Why

Asupersync's HttpClient is lower-level than reqwest:

Headers: Vec<(String, String)> not typed HeaderMap/HeaderValue
Body: Vec<u8> not a builder with .json()
Status: raw u16 not StatusCode enum
Response: body already buffered, no async .json().await
No per-request timeout

Without an adapter, every call site becomes 5-6 lines of boilerplate. The adapter also isolates gitlore from asupersync's pre-1.0 HTTP API.

New file: `src/http.rs` (~100 LOC)

use asupersync::http::h1::{HttpClient, HttpClientConfig, PoolConfig};
use asupersync::http::h1::types::Method;
use asupersync::time::timeout;
use serde::de::DeserializeOwned;
use serde::Serialize;
use std::time::Duration;

use crate::core::error::{LoreError, Result};

pub struct Client {
    inner: HttpClient,
    timeout: Duration,
}

pub struct Response {
    pub status: u16,
    pub reason: String,
    pub headers: Vec<(String, String)>,
    body: Vec<u8>,
}

impl Client {
    pub fn with_timeout(timeout: Duration) -> Self {
        Self {
            inner: HttpClient::with_config(HttpClientConfig {
                pool_config: PoolConfig::builder()
                    .max_connections_per_host(6)
                    .max_total_connections(100)
                    .idle_timeout(Duration::from_secs(90))
                    .build(),
                ..Default::default()
            }),
            timeout,
        }
    }

    pub async fn get(&self, url: &str, headers: &[(&str, &str)]) -> Result<Response> {
        self.execute(Method::Get, url, headers, Vec::new()).await
    }

    pub async fn get_with_query(
        &self,
        url: &str,
        params: &[(&str, String)],
        headers: &[(&str, &str)],
    ) -> Result<Response> {
        let full_url = append_query_params(url, params);
        self.execute(Method::Get, &full_url, headers, Vec::new()).await
    }

    pub async fn post_json<T: Serialize>(
        &self,
        url: &str,
        headers: &[(&str, &str)],
        body: &T,
    ) -> Result<Response> {
        let body_bytes = serde_json::to_vec(body)
            .map_err(|e| LoreError::Other(format!("JSON serialization failed: {e}")))?;
        let mut all_headers = headers.to_vec();
        all_headers.push(("Content-Type", "application/json"));
        self.execute(Method::Post, url, &all_headers, body_bytes).await
    }

    async fn execute(
        &self,
        method: Method,
        url: &str,
        headers: &[(&str, &str)],
        body: Vec<u8>,
    ) -> Result<Response> {
        let header_tuples: Vec<(String, String)> = headers
            .iter()
            .map(|(k, v)| ((*k).to_owned(), (*v).to_owned()))
            .collect();

        let raw = timeout(self.timeout, self.inner.request(method, url, header_tuples, body))
            .await
            .map_err(|_| LoreError::GitLabNetworkError {
                base_url: url.to_string(),
                kind: NetworkErrorKind::Timeout,
                detail: Some(format!("Request timed out after {:?}", self.timeout)),
            })?
            .map_err(|e| LoreError::GitLabNetworkError {
                base_url: url.to_string(),
                kind: classify_transport_error(&e),
                detail: Some(format!("{e:?}")),
            })?;

        Ok(Response {
            status: raw.status,
            reason: raw.reason,
            headers: raw.headers,
            body: raw.body,
        })
    }
}

impl Response {
    pub fn is_success(&self) -> bool {
        (200..300).contains(&self.status)
    }

    pub fn json<T: DeserializeOwned>(&self) -> Result<T> {
        serde_json::from_slice(&self.body)
            .map_err(|e| LoreError::Other(format!("JSON parse error: {e}")))
    }

    pub fn text(self) -> Result<String> {
        String::from_utf8(self.body)
            .map_err(|e| LoreError::Other(format!("UTF-8 decode error: {e}")))
    }

    pub fn header(&self, name: &str) -> Option<&str> {
        self.headers
            .iter()
            .find(|(k, _)| k.eq_ignore_ascii_case(name))
            .map(|(_, v)| v.as_str())
    }

    /// Returns all values for a header name (case-insensitive).
    /// Needed for multi-value headers like `Link` used in pagination.
    pub fn headers_all(&self, name: &str) -> Vec<&str> {
        self.headers
            .iter()
            .filter(|(k, _)| k.eq_ignore_ascii_case(name))
            .map(|(_, v)| v.as_str())
            .collect()
    }
}

/// Appends query parameters to a URL.
///
/// Edge cases handled:
/// - URLs with existing `?query` → appends with `&`
/// - URLs with `#fragment` → inserts query before fragment
/// - Empty params → returns URL unchanged
/// - Repeated keys → preserved as-is (GitLab API uses repeated `labels[]`)
fn append_query_params(url: &str, params: &[(&str, String)]) -> String {
    if params.is_empty() {
        return url.to_string();
    }
    let query: String = params
        .iter()
        .map(|(k, v)| format!("{}={}", urlencoding::encode(k), urlencoding::encode(v)))
        .collect::<Vec<_>>()
        .join("&");

    // Preserve URL fragments: split on '#', insert query, rejoin
    let (base, fragment) = match url.split_once('#') {
        Some((b, f)) => (b, Some(f)),
        None => (url, None),
    };
    let with_query = if base.contains('?') {
        format!("{base}&{query}")
    } else {
        format!("{base}?{query}")
    };
    match fragment {
        Some(f) => format!("{with_query}#{f}"),
        None => with_query,
    }
}

Response body size guard

The adapter buffers entire response bodies in memory (Vec<u8>). A misconfigured endpoint or unexpected redirect to a large file could cause unbounded memory growth. Add a max response body size check in execute():

const MAX_RESPONSE_BODY_BYTES: usize = 64 * 1024 * 1024; // 64 MiB — generous for JSON, catches runaways

// In execute(), after receiving raw response:
if raw.body.len() > MAX_RESPONSE_BODY_BYTES {
    return Err(LoreError::Other(format!(
        "Response body too large: {} bytes (max {})",
        raw.body.len(),
        MAX_RESPONSE_BODY_BYTES,
    )));
}

This is a safety net, not a tight constraint. GitLab JSON responses are typically < 1 MiB. Ollama embedding responses are < 100 KiB per batch. The 64 MiB limit catches runaways without interfering with normal operation.

Timeout behavior

Every request is wrapped with asupersync::time::timeout(self.timeout, ...). Default timeouts:

GitLab REST/GraphQL: 30s
Ollama: configurable (default 60s)
Ollama health check: 5s

Phase 2: Migrate the 3 HTTP Modules

2a. `gitlab/client.rs` (REST API)

Imports:

// Remove
use reqwest::header::{ACCEPT, HeaderMap, HeaderValue};
use reqwest::{Client, Response, StatusCode};

// Add
use crate::http::{Client, Response};

Client construction (lines 68-96):

// Before: reqwest::Client::builder().default_headers(h).timeout(d).build()
// After:
let client = Client::with_timeout(Duration::from_secs(30));

request() method (lines 129-170):

// Before
let response = self.client.get(&url)
    .header("PRIVATE-TOKEN", &self.token)
    .send().await
    .map_err(|e| LoreError::GitLabNetworkError { ... })?;

// After
let response = self.client.get(&url, &[
    ("PRIVATE-TOKEN", &self.token),
    ("Accept", "application/json"),
]).await?;

request_with_headers() method (lines 510-559):

// Before
let response = self.client.get(&url)
    .query(params)
    .header("PRIVATE-TOKEN", &self.token)
    .send().await?;
let headers = response.headers().clone();

// After
let response = self.client.get_with_query(&url, params, &[
    ("PRIVATE-TOKEN", &self.token),
    ("Accept", "application/json"),
]).await?;
// headers already owned in response.headers

handle_response() (lines 182-219):

// Before: async fn (consumed body with .text().await)
// After: sync fn (body already buffered in Response)
fn handle_response<T: DeserializeOwned>(&self, response: Response, path: &str) -> Result<T> {
    match response.status {
        401 => Err(LoreError::GitLabAuthFailed),
        404 => Err(LoreError::GitLabNotFound { resource: path.into() }),
        429 => {
            let retry_after = response.header("retry-after")
                .and_then(|v| v.parse().ok())
                .unwrap_or(60);
            Err(LoreError::GitLabRateLimited { retry_after })
        }
        s if (200..300).contains(&s) => response.json::<T>(),
        s => Err(LoreError::Other(format!("GitLab API error: {} {}", s, response.reason))),
    }
}

Pagination -- No structural changes. async_stream::stream! and header parsing stay the same. Only the response type changes:

// Before: headers.get("x-next-page").and_then(|v| v.to_str().ok())
// After:  response.header("x-next-page")

parse_link_header_next -- Change signature from (headers: &HeaderMap) to (headers: &[(String, String)]) and find by case-insensitive name.

2b. `gitlab/graphql.rs`

// Before
let response = self.http.post(&url)
    .header("Authorization", format!("Bearer {}", self.token))
    .header("Content-Type", "application/json")
    .json(&body).send().await?;
let json: Value = response.json().await?;

// After
let bearer = format!("Bearer {}", self.token);
let response = self.http.post_json(&url, &[
    ("Authorization", &bearer),
], &body).await?;
let json: Value = response.json()?;

Status matching changes from response.status().as_u16() to response.status (already u16).

2c. `embedding/ollama.rs`

// Health check
let response = self.client.get(&url, &[]).await?;
let tags: TagsResponse = response.json()?;

// Embed batch
let response = self.client.post_json(&url, &[], &request).await?;
if !response.is_success() {
    let status = response.status; // capture before .text() consumes response
    let body = response.text()?;
    return Err(LoreError::EmbeddingFailed { document_id: 0, reason: format!("HTTP {status}: {body}") });
}
let embed_response: EmbedResponse = response.json()?;

Standalone health check (check_ollama_health): Currently creates a temporary reqwest::Client. Replace with temporary crate::http::Client:

pub async fn check_ollama_health(base_url: &str) -> bool {
    let client = Client::with_timeout(Duration::from_secs(5));
    let url = format!("{base_url}/api/tags");
    client.get(&url, &[]).await.map_or(false, |r| r.is_success())
}

Phase 3: Swap the Runtime + Deep Cx Threading

3a. Cargo.toml

[dependencies]
# Remove:
# reqwest = { version = "0.12", features = ["json"] }
# tokio = { version = "1", features = ["rt-multi-thread", "macros", "time", "signal"] }

# Add:
asupersync = { version = "0.2", features = ["tls", "tls-native-roots"] }

# Keep unchanged:
async-stream = "0.3"
futures = { version = "0.3", default-features = false, features = ["alloc"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
urlencoding = "2"

[dev-dependencies]
tempfile = "3"
wiremock = "0.6"
tokio = { version = "1", features = ["rt", "macros"] }

3b. rust-toolchain.toml

[toolchain]
channel = "nightly-2026-03-01"  # Pin specific date to avoid surprise breakage

Update the date as needed when newer nightlies are verified. Never use bare "nightly" in production.

3c. Entrypoint (`main.rs:53`)

// Before
#[tokio::main]
async fn main() -> Result<()> { ... }

// After
#[asupersync::main]
async fn main(cx: &Cx) -> Outcome<()> { ... }

3d. Signal handler (`core/shutdown.rs`)

// After (Phase 0 extracted it; now rewrite for asupersync)
pub async fn install_ctrl_c_handler(cx: &Cx, signal: ShutdownSignal) {
    cx.spawn("ctrl-c-handler", async move |cx| {
        cx.shutdown_signal().await;
        eprintln!("\nInterrupted, finishing current batch... (Ctrl+C again to force quit)");
        signal.cancel();
        // Preserve hard-exit on second Ctrl+C (same behavior as Phase 0a)
        cx.shutdown_signal().await;
        std::process::exit(130);
    });
}

Cleanup concern: std::process::exit(130) on second Ctrl+C bypasses all drop guards, flush operations, and asupersync region cleanup. This is intentional (user demanded hard exit) but means any in-progress DB transaction will be abandoned mid-write. SQLite's journaling makes this safe (uncommitted transactions are rolled back on next open), but verify this holds for WAL mode if enabled. Consider logging a warning before exit so users understand incomplete operations may need re-sync.

3e. Rate limiter sleep

// Before
use tokio::time::sleep;

// After
use asupersync::time::sleep;

3f. Deep Cx threading

Thread Cx from main() through command dispatch into the orchestrator and ingestion modules. This enables region-scoped cancellation for join_all batches.

Function signatures that need cx: &Cx added:

Module	Functions
`main.rs`	Command dispatch match arms for `sync`, `ingest`, `embed`
`cli/commands/sync.rs`	`run_sync()`
`cli/commands/ingest.rs`	`run_ingest_command()`, `run_ingest()`
`cli/commands/embed.rs`	`run_embed()`
`cli/commands/sync_surgical.rs`	`run_sync_surgical()`
`ingestion/orchestrator.rs`	`ingest_issues()`, `ingest_merge_requests()`, `ingest_discussions()`, etc.
`ingestion/surgical.rs`	`surgical_sync()`
`embedding/pipeline.rs`	`embed_documents()`, `embed_batch_group()`

Region wrapping for join_all batches (orchestrator.rs):

// Before
let prefetched_batch = join_all(prefetch_futures).await;

// After -- cancel-correct region with result collection
let (tx, rx) = std::sync::mpsc::channel();
cx.region(|scope| async {
    for future in prefetch_futures {
        let tx = tx.clone();
        scope.spawn(async move |_cx| {
            let result = future.await;
            let _ = tx.send(result);
        });
    }
    drop(tx);
}).await;
let prefetched_batch: Vec<_> = rx.into_iter().collect();

IMPORTANT: Semantic differences beyond ordering. Replacing join_all with region-spawned tasks changes three behaviors:

Ordering: join_all preserves input order — results[i] corresponds to futures[i]. The std::sync::mpsc channel pattern does NOT (results arrive in completion order). If downstream logic assumes positional alignment (e.g., zipping results with input items by index), this is a silent correctness bug. Options:
- Send (index, result) tuples through the channel and sort by index after collection.
- If scope.spawn() returns a JoinHandle<T>, collect handles in order and await them sequentially.
Error aggregation: join_all runs all futures to completion even if some fail, collecting all results. Region-spawned tasks with a channel will also run all tasks, but if the region is cancelled mid-flight (e.g., Ctrl+C), some results are lost. Decide per call site: should partial results be processed, or should the entire batch be retried?
Backpressure: join_all with N futures creates N concurrent tasks. Region-spawned tasks behave similarly, but if the region has concurrency limits, backpressure semantics change. Verify asupersync's region API does not impose implicit concurrency caps.
Late result loss on cancellation: When a region is cancelled, tasks that have completed but whose results haven't been received yet may have already sent to the channel. However, tasks that are mid-flight will be dropped, and their results never sent. The channel receiver must drain whatever was sent, but the caller must treat a cancelled region's results as incomplete — never assume all N results arrived. Document per call site whether partial results are safe to process or whether the entire batch should be discarded on cancellation.

Audit every join_all call site for all four assumptions before choosing the pattern.

Note: The exact result-collection pattern depends on asupersync's region API. If scope.spawn() returns a JoinHandle<T>, prefer collecting handles and awaiting them (preserves ordering and simplifies error handling).

This is the biggest payoff: if Ctrl+C fires during a prefetch batch, the region cancels all in-flight HTTP requests with bounded cleanup instead of silently dropping them.

Estimated signature changes: ~15 functions gain a cx: &Cx parameter.

Phasing the Cx threading (risk reduction): Rather than threading cx through all ~15 functions at once, split into two steps:

Step 1: Thread cx through the orchestration path only (main.rs dispatch → run_sync/run_ingest → orchestrator functions). This is where region-wrapping join_all batches happens — the actual cancellation payoff. Verify invariants pass.
Step 2: Widen to the command layer and embedding pipeline (run_embed, embed_documents, embed_batch_group, sync_surgical). These are lower-risk since they don't have the same fan-out patterns.

This reduces the blast radius of Step 1 and provides an earlier validation checkpoint. If Step 1 surfaces problems, Step 2 hasn't been started yet.

Phase 4: Test Migration

Keep on `#[tokio::test]` (wiremock tests -- 42 tests)

No changes. tokio is in [dev-dependencies] with features = ["rt", "macros"].

Coverage gap: These tests validate protocol correctness (request format, response parsing, status code handling, pagination) through the adapter layer, but they do NOT exercise asupersync's runtime behavior (timeouts, connection pooling, cancellation). This is acceptable because:

Protocol correctness is the higher-value test target — it catches most regressions
Runtime-specific behavior is covered by the new cancellation integration tests (below)
The adapter layer is thin enough that runtime differences are unlikely to affect request/response semantics

Adapter-layer test gap: The 42 wiremock tests validate protocol correctness (request format, response parsing) but run on tokio, not asupersync. This means the adapter's actual behavior under the production runtime is untested by mocked-response tests. To close this gap, add 3-5 asupersync-native integration tests that exercise the adapter against a simple HTTP server (e.g., hyper or a raw TCP listener) rather than wiremock:

GET with headers + JSON response — verify header passing and JSON deserialization through the adapter.
POST with JSON body — verify Content-Type injection and body serialization.
429 + Retry-After — verify the adapter surfaces rate-limit responses correctly.
Timeout — verify the adapter's asupersync::time::timeout wrapper fires.
Large response rejection — verify the body size guard triggers.

These tests are cheap to write (~50 LOC each) and close the "works on tokio but does it work on asupersync?" gap that GPT 5.3 flagged.

File	Tests
`gitlab/graphql_tests.rs`	30
`gitlab/client_tests.rs`	4
`embedding/pipeline_tests.rs`	4
`ingestion/surgical_tests.rs`	4

Switch to `#[asupersync::test]` (no wiremock -- 13 tests)

File	Tests
`core/timeline_seed_tests.rs`	13

Already `#[test]` (sync -- ~35 files)

No changes needed.

New: Cancellation integration tests (asupersync-native)

Wiremock tests on tokio validate protocol/serialization correctness but cannot test asupersync's cancellation and region semantics. Add asupersync-native integration tests for:

Ctrl+C during fan-out: Simulate cancellation mid-batch in orchestrator. Verify all in-flight tasks are drained, no task leaks, no obligation leaks.
Region quiescence: Verify that after a region completes (normal or cancelled), no background tasks remain running.
Transaction integrity under cancellation: Cancel during an ingestion batch that has fetched data but not yet written to DB. Verify no partial data is committed.

These tests use asupersync's deterministic lab runtime, which is one of the primary motivations for this migration.

Phase 5: Verify and Harden

Verification checklist

cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt --check
cargo test

Specific things to verify

async-stream on nightly -- Does async_stream 0.3 compile on current nightly?
TLS root certs on macOS -- Does tls-native-roots pick up system CA certs?
Connection pool under concurrency -- Do join_all batches (4-8 concurrent requests to same host) work without pool deadlock?
Pagination streams -- Do async_stream::stream! pagination generators work unchanged?
Wiremock test isolation -- Do wiremock tests pass with tokio only in dev-deps?

HTTP behavior parity acceptance criteria

reqwest provides several implicit behaviors that asupersync's h1 client may not. Each must pass a concrete acceptance test before the migration is considered complete:

reqwest default	Acceptance criterion	Pass/Fail test
Automatic redirect following (up to 10)	If GitLab returns 3xx, gitlore must not silently lose the response. Either follow the redirect or surface a clear error.	Send a request to wiremock returning 301 → verify adapter returns the redirect status (not an opaque failure)
Automatic gzip/deflate decompression	Not required — JSON responses are small.	N/A (no test needed)
Proxy from `HTTP_PROXY`/`HTTPS_PROXY` env	If `HTTP_PROXY` is set, requests must route through it. If asupersync lacks proxy support, document this as a known limitation.	Set `HTTP_PROXY=http://127.0.0.1:9999` → verify connection attempt targets the proxy, or document that proxy is unsupported
Connection keep-alive	Pagination batches (4-8 sequential requests to same host) must reuse connections.	Measure with `ss`/`netstat`: 8 paginated requests should use ≤2 TCP connections
System DNS resolution	Hostnames must resolve via OS resolver.	Verify `lore sync` works against a hostname (not just IP)
Request body Content-Length	POST requests must include Content-Length header (some proxies/WAFs require it).	Inspect outgoing request headers in wiremock test
TLS certificate validation	HTTPS requests must validate server certificates using system CA store.	Verify `lore sync` succeeds against production GitLab (valid cert) and fails against self-signed cert

Cancellation + DB transaction invariants

Region-based cancellation stops HTTP tasks cleanly, but partial ingestion can leave the database in an inconsistent state if cancellation fires between "fetched data" and "wrote to DB". The following invariants must hold and be tested:

INV-1: Atomic batch writes. Each ingestion batch (issues, MRs, discussions) writes to the DB inside a single unchecked_transaction(). If the transaction is not committed, no partial data from that batch is visible. This is already the case for most ingestion paths — audit all paths and fix any that write outside a transaction.

INV-2: Region cancellation cannot corrupt committed data. A cancelled region may abandon in-flight HTTP requests, but it must not interrupt a DB transaction mid-write. This holds naturally because SQLite transactions are synchronous (not async) — once tx.execute() starts, it runs to completion on the current thread regardless of task cancellation. Verify this assumption holds for WAL mode.

Hard rule: no .await between transaction open and commit/rollback. Cancellation can fire at any .await point. If an .await exists between unchecked_transaction() and tx.commit(), a cancelled region could drop the transaction guard mid-batch, rolling back partial writes silently. Audit all ingestion paths to confirm this invariant holds. If any path must do async work mid-transaction (e.g., fetching related data), restructure to fetch-then-write: complete all async work first, then open the transaction, write synchronously, and commit.

INV-3: No partial batch visibility. If cancellation fires after fetching N items but before the batch transaction commits, zero items from that batch are persisted. The next sync picks up where it left off using cursor-based pagination.

INV-4: ShutdownSignal + region cancellation are complementary. The existing ShutdownSignal check-before-write pattern in orchestrator loops (if signal.is_cancelled() { break; }) remains the first line of defense. Region cancellation is the second — it ensures in-flight HTTP tasks are drained even if the orchestrator loop has already moved past the signal check. Both mechanisms must remain active.

Test plan for invariants:

INV-1: Cancellation integration test — cancel mid-batch, verify DB has zero partial rows from that batch
INV-2: Verify unchecked_transaction() commit is not interruptible by task cancellation (lab runtime test)
INV-3: Cancel after fetch, re-run sync, verify no duplicates and no gaps
INV-4: Verify both ShutdownSignal and region cancellation are triggered on Ctrl+C

File Change Summary

File	Change	LOC
`Cargo.toml`	Swap deps	~10
`rust-toolchain.toml`	NEW -- set nightly	3
`src/http.rs`	NEW -- adapter layer	~100
`src/main.rs`	Entrypoint macro, Cx threading, remove 4 signal handlers	~40
`src/core/shutdown.rs`	Extract + rewrite signal handler	~20
`src/core/error.rs`	Remove reqwest::Error, change GitLabNetworkError (Phase 0d)	~10
`src/gitlab/client.rs`	Replace reqwest, remove tokio imports, adapt all methods	~80
`src/gitlab/graphql.rs`	Replace reqwest	~20
`src/embedding/ollama.rs`	Replace reqwest	~20
`src/cli/commands/sync.rs`	Add Cx param	~5
`src/cli/commands/ingest.rs`	Add Cx param	~5
`src/cli/commands/embed.rs`	Add Cx param	~5
`src/cli/commands/sync_surgical.rs`	Add Cx param	~5
`src/ingestion/orchestrator.rs`	Add Cx param, region-wrap join_all	~30
`src/ingestion/surgical.rs`	Add Cx param	~10
`src/embedding/pipeline.rs`	Add Cx param	~10
`src/core/timeline_seed_tests.rs`	Swap test macro	~13

Total: ~16 files modified, 1 new file, ~400-500 LOC changed.

Execution Order

Phase 0a-0c (prep, safe, independent)
  |
  v
Phase 0d (error type migration -- required before adapter compiles)
  |
  v
DECISION GATE: verify nightly + asupersync + tls-native-roots compile AND behavioral smoke tests pass
  |
  v
Phase 1 (adapter layer, compiles but unused) ----+
  |                                               |
  v                                               |  These 3 are one
Phase 2 (migrate 3 HTTP modules to adapter) ------+  atomic commit
  |                                               |
  v                                               |
Phase 3 (swap runtime, Cx threading) ------------+
  |
  v
Phase 4 (test migration)
  |
  v
Phase 5 (verify + harden)

Phase 0a-0c can be committed independently (good cleanup regardless). Phase 0d (error types) can also land independently, but MUST precede the adapter layer. Decision gate: After Phase 0d, create rust-toolchain.toml with nightly pin and verify asupersync = "0.2" compiles with tls-native-roots on macOS. Then run behavioral smoke tests in a throwaway binary or integration test:

TLS validation: HTTPS GET to a public endpoint (e.g., https://gitlab.com/api/v4/version) succeeds with valid cert.
DNS resolution: Request using hostname (not IP) resolves correctly.
Redirect handling: GET to a 301/302 endpoint — verify the adapter returns the redirect status (not an opaque error) so call sites can decide whether to follow.
Timeout behavior: Request to a slow/non-responsive endpoint times out within the configured duration.
Connection pooling: 4 sequential requests to the same host reuse connections (verify via debug logging or ss/netstat).

If compilation fails or any behavioral test reveals a showstopper (e.g., TLS doesn't work on macOS, timeouts don't fire), stop and evaluate the tokio CancellationToken fallback before investing in Phases 1-3.

Compile-only gating is insufficient — this migration's failure modes are semantic (HTTP behavior parity), not just syntactic.

Phases 1-3 must land together (removing reqwest requires both the adapter AND the new runtime). Phases 4-5 are cleanup that can be incremental.

Rollback Strategy

If the migration stalls or asupersync proves unviable after partial completion:

Phase 0a-0c completed: No rollback needed. These are independently valuable cleanup regardless of runtime choice.
Phase 0d completed: GitLabNetworkError { detail } is runtime-agnostic. Keep it.
Phases 1-3 partially completed: These must land atomically. If any phase in 1-3 fails, revert the entire atomic commit. The adapter layer (Phase 1) imports asupersync types, so it cannot exist without the runtime.
Full rollback to tokio: If asupersync is abandoned entirely, the fallback path is tokio + CancellationToken + JoinSet (see "Why not tokio CancellationToken + JoinSet?" above). The adapter layer design is still valid — swap asupersync::http for reqwest behind the same crate::http::Client API.

Decision point: After Phase 0 is complete, verify asupersync compiles on the pinned nightly with tls-native-roots before committing to Phases 1-3. If TLS or nightly issues surface, stop and evaluate the tokio fallback.

Concrete escape hatch triggers (abandon asupersync, fall back to tokio + CancellationToken + JoinSet):

Nightly breakage > 7 days: If the pinned nightly breaks and no newer nightly restores compilation within 7 days, abort.
TLS incompatibility: If tls-native-roots cannot validate certificates on macOS (system CA store) and tls-webpki-roots also fails, abort.
API instability: If asupersync releases a breaking change to HttpClient, region(), or Cx APIs before our migration is complete, evaluate migration cost. If > 2 days of rework, abort.
Wiremock incompatibility: If keeping wiremock tests on tokio while production runs asupersync causes test failures or flaky behavior that cannot be resolved in 1 day, abort.

Risks

Risk	Severity	Mitigation
asupersync pre-1.0 API changes	High	Adapter layer isolates call sites. Pin exact version.
Nightly Rust breakage	Medium-High	Pin nightly date in rust-toolchain.toml. CI tests on nightly. Coupling runtime + toolchain migration amplifies risk — escape hatch triggers defined in Rollback Strategy.
TLS cert issues on macOS	Medium	Test early in Phase 5. Fallback: `tls-webpki-roots` (Mozilla bundle).
Connection pool behavior under load	Medium	Stress test with `join_all` of 8+ concurrent requests in Phase 5.
async-stream nightly compat	Low	Widely used crate, likely fine. Fallback: manual Stream impl.
Build time increase	Low	Measure before/after. asupersync may be heavier than tokio.
Reqwest behavioral drift	Medium	reqwest has implicit redirect/proxy/compression handling. Audit each (see Phase 5 table). GitLab API doesn't redirect, so low actual risk.
Partial ingestion on cancel	Medium	Region cancellation can fire between HTTP fetch and DB write. Verify transaction boundaries align with region scope (see Phase 5).
Unbounded response body buffering	Low	Adapter buffers full response bodies. Mitigated by 64 MiB size guard in adapter `execute()`.
Manual URL/header handling correctness	Low-Medium	`append_query_params` and case-insensitive header scans replicate reqwest behavior manually. Mitigated by unit tests for edge cases (existing query params, fragments, repeated keys, case folding).

39 KiB Raw Permalink Blame History