docs(plans): expand asupersync migration with decision gates, rollback, and invariants
Major additions to the migration plan based on review feedback: Alternative analysis: - Add "Why not tokio CancellationToken + JoinSet?" section explaining why obligation tracking and single-migration cost favor asupersync over incremental tokio fixes. Error handling depth: - Add NetworkErrorKind enum design for preserving error categories (timeout, DNS, TLS, connection refused) without coupling LoreError to any HTTP client. - Add response body size guard (64 MiB) to prevent unbounded memory growth from misconfigured endpoints. Adapter layer refinements: - Expand append_query_params with URL fragment handling, edge case docs, and doc comments. - Add contention constraint note for std::sync::Mutex rate limiter. Cancellation invariants (INV-1 through INV-4): - Atomic batch writes, no .await between tx open/commit, ShutdownSignal + region cancellation complementarity. - Concrete test plan for each invariant. Semantic ordering concerns: - Document 4 behavioral differences when replacing join_all with region-spawned tasks (ordering, error aggregation, backpressure, late result loss on cancellation). HTTP behavior parity: - Replace informational table with concrete acceptance criteria and pass/fail tests for redirects, proxy, keep-alive, DNS, TLS, and Content-Length. Phasing refinements: - Add Cx threading sub-steps (orchestration path first, then command/embedding layer) for blast radius reduction. - Add decision gate between Phase 0d and Phase 1 requiring compile + behavioral smoke tests before committing to runtime swap. Rollback strategy: - Per-phase rollback guidance with concrete escape hatch triggers (nightly breakage > 7d, TLS incompatibility, API instability, wiremock issues). Testing depth: - Adapter-layer test gap analysis with 5 specific asupersync-native integration tests. - Cancellation integration test specifications. - Coverage gap documentation for wiremock-on-tokio tests. Risk register additions: - Unbounded response body buffering, manual URL/header handling correctness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -22,6 +22,15 @@ Asupersync is a cancel-correct async runtime with region-owned tasks, obligation
|
||||
- Pre-1.0 runtime dependency (mitigated by adapter layer + version pinning)
|
||||
- Deeper function signature changes for Cx threading
|
||||
|
||||
### Why not tokio CancellationToken + JoinSet?
|
||||
|
||||
The core problems (Ctrl+C drops requests, no structured cancellation) *can* be fixed without replacing the runtime. Tokio's `CancellationToken` + `JoinSet` + explicit task tracking gives structured cancellation for fan-out patterns. This was considered and rejected for two reasons:
|
||||
|
||||
1. **Obligation tracking is the real win.** CancellationToken/JoinSet fix the "cancel cleanly" problem but don't give us obligation tracking (compile-time proof that all spawned work is awaited) or deterministic lab testing. These are the features that prevent *future* concurrency bugs, not just the current Ctrl+C issue.
|
||||
2. **Separation of concerns.** Fixing Ctrl+C with tokio primitives first, then migrating the runtime second, doubles the migration effort (rewrite fan-out twice). Since we have no users and no backwards compatibility concerns, a single clean migration is lower total cost.
|
||||
|
||||
If asupersync proves unviable (nightly breakage, API instability), the fallback is exactly this: tokio + CancellationToken + JoinSet.
|
||||
|
||||
---
|
||||
|
||||
## Current Tokio Usage Inventory
|
||||
@@ -103,6 +112,8 @@ let delay = self.rate_limiter.lock().expect("rate limiter poisoned").check_delay
|
||||
|
||||
Note: `.expect()` over `.unwrap()` for clarity. Poisoning is near-impossible here (the critical section is a trivial `Instant::now()` check), but the explicit message aids debugging if it ever fires.
|
||||
|
||||
**Contention constraint:** `std::sync::Mutex` blocks the executor thread while held. This is safe *only* because the critical section is a single `Instant::now()` comparison with no I/O. If the rate limiter ever grows to include async work (HTTP calls, DB queries), it must move back to an async-aware lock. Document this constraint with a comment at the lock site.
|
||||
|
||||
### 0c. Replace tokio::join! with futures::join!
|
||||
|
||||
In `gitlab/client.rs:729,736`. `futures::join!` is runtime-agnostic and already in deps.
|
||||
@@ -116,7 +127,7 @@ In `gitlab/client.rs:729,736`. `futures::join!` is runtime-agnostic and already
|
||||
|
||||
## Phase 0d: Error Type Migration (must precede adapter layer)
|
||||
|
||||
The adapter layer (Phase 1) uses `GitLabNetworkError { detail: Option<String> }`, which requires the error type change from Phase 4. Move this change up front so Phases 1-3 compile as a unit.
|
||||
The adapter layer (Phase 1) uses `GitLabNetworkError { detail: Option<String> }`, which requires this error type change before the adapter compiles. Placed here so Phases 1-3 compile as a unit.
|
||||
|
||||
### `src/core/error.rs`
|
||||
|
||||
@@ -137,6 +148,33 @@ GitLabNetworkError {
|
||||
|
||||
The adapter layer stringifies HTTP client errors at the boundary so `LoreError` doesn't depend on any HTTP client's error types. This also means the existing reqwest call sites that construct `GitLabNetworkError` must be updated to pass `detail: Some(format!("{e:?}"))` instead of `source: Some(e)` -- but those sites are rewritten in Phase 2 anyway, so no extra work.
|
||||
|
||||
**Note on error granularity:** Flattening all HTTP errors to `detail: Option<String>` loses the distinction between timeouts, TLS failures, DNS resolution failures, and connection resets. To preserve actionable error categories without coupling `LoreError` to any HTTP client, add a lightweight `NetworkErrorKind` enum:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum NetworkErrorKind {
|
||||
Timeout,
|
||||
ConnectionRefused,
|
||||
DnsResolution,
|
||||
Tls,
|
||||
Other,
|
||||
}
|
||||
|
||||
#[error("Cannot connect to GitLab at {base_url}")]
|
||||
GitLabNetworkError {
|
||||
base_url: String,
|
||||
kind: NetworkErrorKind,
|
||||
detail: Option<String>,
|
||||
},
|
||||
```
|
||||
|
||||
The adapter's `execute()` method classifies errors at the boundary:
|
||||
- Timeout from `asupersync::time::timeout` → `NetworkErrorKind::Timeout`
|
||||
- Transport errors from the HTTP client → classified by error type into the appropriate kind
|
||||
- Unknown errors → `NetworkErrorKind::Other`
|
||||
|
||||
This keeps `LoreError` client-agnostic while preserving the ability to make retry decisions based on error *type* (e.g., retry on timeout but not on TLS). The adapter's `execute()` method is the single place where this mapping happens, so adding new kinds is localized.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Build the HTTP Adapter Layer
|
||||
@@ -232,9 +270,14 @@ impl Client {
|
||||
|
||||
let raw = timeout(self.timeout, self.inner.request(method, url, header_tuples, body))
|
||||
.await
|
||||
.map_err(|_| LoreError::Other(format!("Request timed out after {:?}", self.timeout)))?
|
||||
.map_err(|_| LoreError::GitLabNetworkError {
|
||||
base_url: url.to_string(),
|
||||
kind: NetworkErrorKind::Timeout,
|
||||
detail: Some(format!("Request timed out after {:?}", self.timeout)),
|
||||
})?
|
||||
.map_err(|e| LoreError::GitLabNetworkError {
|
||||
base_url: url.to_string(),
|
||||
kind: classify_transport_error(&e),
|
||||
detail: Some(format!("{e:?}")),
|
||||
})?;
|
||||
|
||||
@@ -280,6 +323,13 @@ impl Response {
|
||||
}
|
||||
}
|
||||
|
||||
/// Appends query parameters to a URL.
|
||||
///
|
||||
/// Edge cases handled:
|
||||
/// - URLs with existing `?query` → appends with `&`
|
||||
/// - URLs with `#fragment` → inserts query before fragment
|
||||
/// - Empty params → returns URL unchanged
|
||||
/// - Repeated keys → preserved as-is (GitLab API uses repeated `labels[]`)
|
||||
fn append_query_params(url: &str, params: &[(&str, String)]) -> String {
|
||||
if params.is_empty() {
|
||||
return url.to_string();
|
||||
@@ -289,14 +339,43 @@ fn append_query_params(url: &str, params: &[(&str, String)]) -> String {
|
||||
.map(|(k, v)| format!("{}={}", urlencoding::encode(k), urlencoding::encode(v)))
|
||||
.collect::<Vec<_>>()
|
||||
.join("&");
|
||||
if url.contains('?') {
|
||||
format!("{url}&{query}")
|
||||
|
||||
// Preserve URL fragments: split on '#', insert query, rejoin
|
||||
let (base, fragment) = match url.split_once('#') {
|
||||
Some((b, f)) => (b, Some(f)),
|
||||
None => (url, None),
|
||||
};
|
||||
let with_query = if base.contains('?') {
|
||||
format!("{base}&{query}")
|
||||
} else {
|
||||
format!("{url}?{query}")
|
||||
format!("{base}?{query}")
|
||||
};
|
||||
match fragment {
|
||||
Some(f) => format!("{with_query}#{f}"),
|
||||
None => with_query,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Response body size guard
|
||||
|
||||
The adapter buffers entire response bodies in memory (`Vec<u8>`). A misconfigured endpoint or unexpected redirect to a large file could cause unbounded memory growth. Add a max response body size check in `execute()`:
|
||||
|
||||
```rust
|
||||
const MAX_RESPONSE_BODY_BYTES: usize = 64 * 1024 * 1024; // 64 MiB — generous for JSON, catches runaways
|
||||
|
||||
// In execute(), after receiving raw response:
|
||||
if raw.body.len() > MAX_RESPONSE_BODY_BYTES {
|
||||
return Err(LoreError::Other(format!(
|
||||
"Response body too large: {} bytes (max {})",
|
||||
raw.body.len(),
|
||||
MAX_RESPONSE_BODY_BYTES,
|
||||
)));
|
||||
}
|
||||
```
|
||||
|
||||
This is a safety net, not a tight constraint. GitLab JSON responses are typically < 1 MiB. Ollama embedding responses are < 100 KiB per batch. The 64 MiB limit catches runaways without interfering with normal operation.
|
||||
|
||||
### Timeout behavior
|
||||
|
||||
Every request is wrapped with `asupersync::time::timeout(self.timeout, ...)`. Default timeouts:
|
||||
@@ -498,6 +577,8 @@ pub async fn install_ctrl_c_handler(cx: &Cx, signal: ShutdownSignal) {
|
||||
}
|
||||
```
|
||||
|
||||
**Cleanup concern:** `std::process::exit(130)` on second Ctrl+C bypasses all drop guards, flush operations, and asupersync region cleanup. This is intentional (user demanded hard exit) but means any in-progress DB transaction will be abandoned mid-write. SQLite's journaling makes this safe (uncommitted transactions are rolled back on next open), but verify this holds for WAL mode if enabled. Consider logging a warning before exit so users understand incomplete operations may need re-sync.
|
||||
|
||||
### 3e. Rate limiter sleep
|
||||
|
||||
```rust
|
||||
@@ -545,26 +626,56 @@ cx.region(|scope| async {
|
||||
let prefetched_batch: Vec<_> = rx.into_iter().collect();
|
||||
```
|
||||
|
||||
Note: The exact result-collection pattern depends on asupersync's region API. If `scope.spawn()` returns a `JoinHandle<T>`, prefer collecting handles and awaiting them. The channel pattern above works regardless of API shape.
|
||||
**IMPORTANT: Semantic differences beyond ordering.** Replacing `join_all` with region-spawned tasks changes three behaviors:
|
||||
|
||||
1. **Ordering:** `join_all` preserves input order — results\[i\] corresponds to futures\[i\]. The `std::sync::mpsc` channel pattern does NOT (results arrive in completion order). If downstream logic assumes positional alignment (e.g., zipping results with input items by index), this is a silent correctness bug. Options:
|
||||
- Send `(index, result)` tuples through the channel and sort by index after collection.
|
||||
- If `scope.spawn()` returns a `JoinHandle<T>`, collect handles in order and await them sequentially.
|
||||
|
||||
2. **Error aggregation:** `join_all` runs all futures to completion even if some fail, collecting all results. Region-spawned tasks with a channel will also run all tasks, but if the region is cancelled mid-flight (e.g., Ctrl+C), some results are lost. Decide per call site: should partial results be processed, or should the entire batch be retried?
|
||||
|
||||
3. **Backpressure:** `join_all` with N futures creates N concurrent tasks. Region-spawned tasks behave similarly, but if the region has concurrency limits, backpressure semantics change. Verify asupersync's region API does not impose implicit concurrency caps.
|
||||
|
||||
4. **Late result loss on cancellation:** When a region is cancelled, tasks that have completed but whose results haven't been received yet may have already sent to the channel. However, tasks that are mid-flight will be dropped, and their results never sent. The channel receiver must drain whatever was sent, but the caller must treat a cancelled region's results as incomplete — never assume all N results arrived. Document per call site whether partial results are safe to process or whether the entire batch should be discarded on cancellation.
|
||||
|
||||
Audit every `join_all` call site for all four assumptions before choosing the pattern.
|
||||
|
||||
Note: The exact result-collection pattern depends on asupersync's region API. If `scope.spawn()` returns a `JoinHandle<T>`, prefer collecting handles and awaiting them (preserves ordering and simplifies error handling).
|
||||
|
||||
This is the biggest payoff: if Ctrl+C fires during a prefetch batch, the region cancels all in-flight HTTP requests with bounded cleanup instead of silently dropping them.
|
||||
|
||||
**Estimated signature changes:** ~15 functions gain a `cx: &Cx` parameter.
|
||||
|
||||
---
|
||||
**Phasing the Cx threading (risk reduction):** Rather than threading `cx` through all ~15 functions at once, split into two steps:
|
||||
|
||||
## Phase 4: (Moved to Phase 0d)
|
||||
- **Step 1:** Thread `cx` through the orchestration path only (`main.rs` dispatch → `run_sync`/`run_ingest` → orchestrator functions). This is where region-wrapping `join_all` batches happens — the actual cancellation payoff. Verify invariants pass.
|
||||
- **Step 2:** Widen to the command layer and embedding pipeline (`run_embed`, `embed_documents`, `embed_batch_group`, `sync_surgical`). These are lower-risk since they don't have the same fan-out patterns.
|
||||
|
||||
Error type migration was moved to Phase 0d to resolve a compile-order dependency: the adapter layer (Phase 1) uses the new `GitLabNetworkError { detail }` shape.
|
||||
This reduces the blast radius of Step 1 and provides an earlier validation checkpoint. If Step 1 surfaces problems, Step 2 hasn't been started yet.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Test Migration
|
||||
## Phase 4: Test Migration
|
||||
|
||||
### Keep on `#[tokio::test]` (wiremock tests -- 42 tests)
|
||||
|
||||
No changes. `tokio` is in `[dev-dependencies]` with `features = ["rt", "macros"]`.
|
||||
|
||||
**Coverage gap:** These tests validate protocol correctness (request format, response parsing, status code handling, pagination) through the adapter layer, but they do NOT exercise asupersync's runtime behavior (timeouts, connection pooling, cancellation). This is acceptable because:
|
||||
1. Protocol correctness is the higher-value test target — it catches most regressions
|
||||
2. Runtime-specific behavior is covered by the new cancellation integration tests (below)
|
||||
3. The adapter layer is thin enough that runtime differences are unlikely to affect request/response semantics
|
||||
|
||||
**Adapter-layer test gap:** The 42 wiremock tests validate protocol correctness (request format, response parsing) but run on tokio, not asupersync. This means the adapter's actual behavior under the production runtime is untested by mocked-response tests. To close this gap, add 3-5 asupersync-native integration tests that exercise the adapter against a simple HTTP server (e.g., `hyper` or a raw TCP listener) rather than wiremock:
|
||||
|
||||
1. **GET with headers + JSON response** — verify header passing and JSON deserialization through the adapter.
|
||||
2. **POST with JSON body** — verify Content-Type injection and body serialization.
|
||||
3. **429 + Retry-After** — verify the adapter surfaces rate-limit responses correctly.
|
||||
4. **Timeout** — verify the adapter's `asupersync::time::timeout` wrapper fires.
|
||||
5. **Large response rejection** — verify the body size guard triggers.
|
||||
|
||||
These tests are cheap to write (~50 LOC each) and close the "works on tokio but does it work on asupersync?" gap that GPT 5.3 flagged.
|
||||
|
||||
| File | Tests |
|
||||
|------|-------|
|
||||
| `gitlab/graphql_tests.rs` | 30 |
|
||||
@@ -582,9 +693,19 @@ No changes. `tokio` is in `[dev-dependencies]` with `features = ["rt", "macros"]
|
||||
|
||||
No changes needed.
|
||||
|
||||
### New: Cancellation integration tests (asupersync-native)
|
||||
|
||||
Wiremock tests on tokio validate protocol/serialization correctness but cannot test asupersync's cancellation and region semantics. Add asupersync-native integration tests for:
|
||||
|
||||
1. **Ctrl+C during fan-out:** Simulate cancellation mid-batch in orchestrator. Verify all in-flight tasks are drained, no task leaks, no obligation leaks.
|
||||
2. **Region quiescence:** Verify that after a region completes (normal or cancelled), no background tasks remain running.
|
||||
3. **Transaction integrity under cancellation:** Cancel during an ingestion batch that has fetched data but not yet written to DB. Verify no partial data is committed.
|
||||
|
||||
These tests use asupersync's deterministic lab runtime, which is one of the primary motivations for this migration.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Verify and Harden
|
||||
## Phase 5: Verify and Harden
|
||||
|
||||
### Verification checklist
|
||||
|
||||
@@ -603,25 +724,39 @@ cargo test
|
||||
4. **Pagination streams** -- Do `async_stream::stream!` pagination generators work unchanged?
|
||||
5. **Wiremock test isolation** -- Do wiremock tests pass with tokio only in dev-deps?
|
||||
|
||||
### Reqwest behavioral differences to audit
|
||||
### HTTP behavior parity acceptance criteria
|
||||
|
||||
reqwest provides several implicit behaviors that asupersync's h1 client may not. Verify each:
|
||||
reqwest provides several implicit behaviors that asupersync's h1 client may not. Each must pass a concrete acceptance test before the migration is considered complete:
|
||||
|
||||
| reqwest default | gitlore relies on it? | asupersync equivalent |
|
||||
|-----------------|----------------------|----------------------|
|
||||
| Automatic redirect following (up to 10) | Unlikely (GitLab API doesn't redirect) | Verify: if 3xx is returned, does gitlore handle it? |
|
||||
| Automatic gzip/deflate decompression | No (JSON responses are small) | Not needed |
|
||||
| Proxy from `HTTP_PROXY`/`HTTPS_PROXY` env | Possibly (corporate environments) | Must verify asupersync proxy support |
|
||||
| Connection keep-alive | Yes (pagination batches) | Covered by PoolConfig |
|
||||
| System DNS resolution | Yes | Should be same (OS-level) |
|
||||
| reqwest default | Acceptance criterion | Pass/Fail test |
|
||||
|-----------------|---------------------|----------------|
|
||||
| Automatic redirect following (up to 10) | If GitLab returns 3xx, gitlore must not silently lose the response. Either follow the redirect or surface a clear error. | Send a request to wiremock returning 301 → verify adapter returns the redirect status (not an opaque failure) |
|
||||
| Automatic gzip/deflate decompression | Not required — JSON responses are small. | N/A (no test needed) |
|
||||
| Proxy from `HTTP_PROXY`/`HTTPS_PROXY` env | If `HTTP_PROXY` is set, requests must route through it. If asupersync lacks proxy support, document this as a known limitation. | Set `HTTP_PROXY=http://127.0.0.1:9999` → verify connection attempt targets the proxy, or document that proxy is unsupported |
|
||||
| Connection keep-alive | Pagination batches (4-8 sequential requests to same host) must reuse connections. | Measure with `ss`/`netstat`: 8 paginated requests should use ≤2 TCP connections |
|
||||
| System DNS resolution | Hostnames must resolve via OS resolver. | Verify `lore sync` works against a hostname (not just IP) |
|
||||
| Request body Content-Length | POST requests must include Content-Length header (some proxies/WAFs require it). | Inspect outgoing request headers in wiremock test |
|
||||
| TLS certificate validation | HTTPS requests must validate server certificates using system CA store. | Verify `lore sync` succeeds against production GitLab (valid cert) and fails against self-signed cert |
|
||||
|
||||
### Cancellation + DB transaction alignment
|
||||
### Cancellation + DB transaction invariants
|
||||
|
||||
Region-based cancellation stops HTTP tasks cleanly, but partial ingestion can leave the database in an inconsistent state if cancellation fires between "fetched data" and "wrote to DB". Verify:
|
||||
Region-based cancellation stops HTTP tasks cleanly, but partial ingestion can leave the database in an inconsistent state if cancellation fires between "fetched data" and "wrote to DB". The following invariants must hold and be tested:
|
||||
|
||||
- All DB writes in ingestion batches use `unchecked_transaction()` (already the case for most ingestion paths)
|
||||
- Transaction boundaries align with region scope: a cancelled region should not leave partial batch data committed
|
||||
- The existing `ShutdownSignal` check-before-write pattern in orchestrator loops remains functional alongside region cancellation
|
||||
**INV-1: Atomic batch writes.** Each ingestion batch (issues, MRs, discussions) writes to the DB inside a single `unchecked_transaction()`. If the transaction is not committed, no partial data from that batch is visible. This is already the case for most ingestion paths — audit all paths and fix any that write outside a transaction.
|
||||
|
||||
**INV-2: Region cancellation cannot corrupt committed data.** A cancelled region may abandon in-flight HTTP requests, but it must not interrupt a DB transaction mid-write. This holds naturally because SQLite transactions are synchronous (not async) — once `tx.execute()` starts, it runs to completion on the current thread regardless of task cancellation. Verify this assumption holds for WAL mode.
|
||||
|
||||
**Hard rule: no `.await` between transaction open and commit/rollback.** Cancellation can fire at any `.await` point. If an `.await` exists between `unchecked_transaction()` and `tx.commit()`, a cancelled region could drop the transaction guard mid-batch, rolling back partial writes silently. Audit all ingestion paths to confirm this invariant holds. If any path must do async work mid-transaction (e.g., fetching related data), restructure to fetch-then-write: complete all async work first, then open the transaction, write synchronously, and commit.
|
||||
|
||||
**INV-3: No partial batch visibility.** If cancellation fires after fetching N items but before the batch transaction commits, zero items from that batch are persisted. The next sync picks up where it left off using cursor-based pagination.
|
||||
|
||||
**INV-4: ShutdownSignal + region cancellation are complementary.** The existing `ShutdownSignal` check-before-write pattern in orchestrator loops (`if signal.is_cancelled() { break; }`) remains the first line of defense. Region cancellation is the second — it ensures in-flight HTTP tasks are drained even if the orchestrator loop has already moved past the signal check. Both mechanisms must remain active.
|
||||
|
||||
**Test plan for invariants:**
|
||||
- INV-1: Cancellation integration test — cancel mid-batch, verify DB has zero partial rows from that batch
|
||||
- INV-2: Verify `unchecked_transaction()` commit is not interruptible by task cancellation (lab runtime test)
|
||||
- INV-3: Cancel after fetch, re-run sync, verify no duplicates and no gaps
|
||||
- INV-4: Verify both ShutdownSignal and region cancellation are triggered on Ctrl+C
|
||||
|
||||
---
|
||||
|
||||
@@ -660,6 +795,9 @@ Phase 0a-0c (prep, safe, independent)
|
||||
Phase 0d (error type migration -- required before adapter compiles)
|
||||
|
|
||||
v
|
||||
DECISION GATE: verify nightly + asupersync + tls-native-roots compile AND behavioral smoke tests pass
|
||||
|
|
||||
v
|
||||
Phase 1 (adapter layer, compiles but unused) ----+
|
||||
| |
|
||||
v | These 3 are one
|
||||
@@ -669,16 +807,47 @@ Phase 2 (migrate 3 HTTP modules to adapter) ------+ atomic commit
|
||||
Phase 3 (swap runtime, Cx threading) ------------+
|
||||
|
|
||||
v
|
||||
Phase 5 (test migration)
|
||||
Phase 4 (test migration)
|
||||
|
|
||||
v
|
||||
Phase 6 (verify + harden)
|
||||
Phase 5 (verify + harden)
|
||||
```
|
||||
|
||||
Phase 0a-0c can be committed independently (good cleanup regardless).
|
||||
Phase 0d (error types) can also land independently, but MUST precede the adapter layer.
|
||||
**Decision gate:** After Phase 0d, create `rust-toolchain.toml` with nightly pin and verify `asupersync = "0.2"` compiles with `tls-native-roots` on macOS. Then run behavioral smoke tests in a throwaway binary or integration test:
|
||||
|
||||
1. **TLS validation:** HTTPS GET to a public endpoint (e.g., `https://gitlab.com/api/v4/version`) succeeds with valid cert.
|
||||
2. **DNS resolution:** Request using hostname (not IP) resolves correctly.
|
||||
3. **Redirect handling:** GET to a 301/302 endpoint — verify the adapter returns the redirect status (not an opaque error) so call sites can decide whether to follow.
|
||||
4. **Timeout behavior:** Request to a slow/non-responsive endpoint times out within the configured duration.
|
||||
5. **Connection pooling:** 4 sequential requests to the same host reuse connections (verify via debug logging or `ss`/`netstat`).
|
||||
|
||||
If compilation fails or any behavioral test reveals a showstopper (e.g., TLS doesn't work on macOS, timeouts don't fire), stop and evaluate the tokio CancellationToken fallback before investing in Phases 1-3.
|
||||
|
||||
Compile-only gating is insufficient — this migration's failure modes are semantic (HTTP behavior parity), not just syntactic.
|
||||
|
||||
Phases 1-3 must land together (removing reqwest requires both the adapter AND the new runtime).
|
||||
Phases 5-6 are cleanup that can be incremental.
|
||||
Phases 4-5 are cleanup that can be incremental.
|
||||
|
||||
---
|
||||
|
||||
## Rollback Strategy
|
||||
|
||||
If the migration stalls or asupersync proves unviable after partial completion:
|
||||
|
||||
- **Phase 0a-0c completed:** No rollback needed. These are independently valuable cleanup regardless of runtime choice.
|
||||
- **Phase 0d completed:** `GitLabNetworkError { detail }` is runtime-agnostic. Keep it.
|
||||
- **Phases 1-3 partially completed:** These must land atomically. If any phase in 1-3 fails, revert the entire atomic commit. The adapter layer (Phase 1) imports asupersync types, so it cannot exist without the runtime.
|
||||
- **Full rollback to tokio:** If asupersync is abandoned entirely, the fallback path is tokio + `CancellationToken` + `JoinSet` (see "Why not tokio CancellationToken + JoinSet?" above). The adapter layer design is still valid — swap `asupersync::http` for `reqwest` behind the same `crate::http::Client` API.
|
||||
|
||||
**Decision point:** After Phase 0 is complete, verify asupersync compiles on the pinned nightly with `tls-native-roots` before committing to Phases 1-3. If TLS or nightly issues surface, stop and evaluate the tokio fallback.
|
||||
|
||||
**Concrete escape hatch triggers (abandon asupersync, fall back to tokio + CancellationToken + JoinSet):**
|
||||
1. **Nightly breakage > 7 days:** If the pinned nightly breaks and no newer nightly restores compilation within 7 days, abort.
|
||||
2. **TLS incompatibility:** If `tls-native-roots` cannot validate certificates on macOS (system CA store) and `tls-webpki-roots` also fails, abort.
|
||||
3. **API instability:** If asupersync releases a breaking change to `HttpClient`, `region()`, or `Cx` APIs before our migration is complete, evaluate migration cost. If > 2 days of rework, abort.
|
||||
4. **Wiremock incompatibility:** If keeping wiremock tests on tokio while production runs asupersync causes test failures or flaky behavior that cannot be resolved in 1 day, abort.
|
||||
|
||||
---
|
||||
|
||||
@@ -687,10 +856,12 @@ Phases 5-6 are cleanup that can be incremental.
|
||||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| asupersync pre-1.0 API changes | High | Adapter layer isolates call sites. Pin exact version. |
|
||||
| Nightly Rust breakage | Medium | Pin nightly date in rust-toolchain.toml. CI tests on nightly. |
|
||||
| TLS cert issues on macOS | Medium | Test early in Phase 6. Fallback: `tls-webpki-roots` (Mozilla bundle). |
|
||||
| Connection pool behavior under load | Medium | Stress test with `join_all` of 8+ concurrent requests in Phase 6. |
|
||||
| Nightly Rust breakage | Medium-High | Pin nightly date in rust-toolchain.toml. CI tests on nightly. Coupling runtime + toolchain migration amplifies risk — escape hatch triggers defined in Rollback Strategy. |
|
||||
| TLS cert issues on macOS | Medium | Test early in Phase 5. Fallback: `tls-webpki-roots` (Mozilla bundle). |
|
||||
| Connection pool behavior under load | Medium | Stress test with `join_all` of 8+ concurrent requests in Phase 5. |
|
||||
| async-stream nightly compat | Low | Widely used crate, likely fine. Fallback: manual Stream impl. |
|
||||
| Build time increase | Low | Measure before/after. asupersync may be heavier than tokio. |
|
||||
| Reqwest behavioral drift | Medium | reqwest has implicit redirect/proxy/compression handling. Audit each (see Phase 6 table). GitLab API doesn't redirect, so low actual risk. |
|
||||
| Partial ingestion on cancel | Medium | Region cancellation can fire between HTTP fetch and DB write. Verify transaction boundaries align with region scope (see Phase 6). |
|
||||
| Reqwest behavioral drift | Medium | reqwest has implicit redirect/proxy/compression handling. Audit each (see Phase 5 table). GitLab API doesn't redirect, so low actual risk. |
|
||||
| Partial ingestion on cancel | Medium | Region cancellation can fire between HTTP fetch and DB write. Verify transaction boundaries align with region scope (see Phase 5). |
|
||||
| Unbounded response body buffering | Low | Adapter buffers full response bodies. Mitigated by 64 MiB size guard in adapter `execute()`. |
|
||||
| Manual URL/header handling correctness | Low-Medium | `append_query_params` and case-insensitive header scans replicate reqwest behavior manually. Mitigated by unit tests for edge cases (existing query params, fragments, repeated keys, case folding). |
|
||||
|
||||
Reference in New Issue
Block a user