fix: Savepoint leak in embedding pipeline, atomic fail_job, RRF dedup

Three correctness fixes found during peer code review:

Embedding pipeline savepoint leak (HIGH severity):
The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?`
propagation points between them. Any error from record_embedding_error,
clear_document_embeddings, or store_embedding would exit the function
without rolling back, leaving the SQLite connection in a broken
transactional state and causing cascading failures for the rest of the
session. Fixed by extracting page processing into `embed_page()` and
wrapping with explicit rollback-on-error handling.

Dependent queue fail_job race (MEDIUM severity):
fail_job performed a SELECT followed by a separate UPDATE on the
attempts counter without a transaction. Under concurrent lock
reclamation, the attempts value could be read stale. Replaced with a
single atomic UPDATE that increments attempts and computes exponential
backoff entirely in SQL, also halving DB round-trips. Added explicit
error when the job no longer exists.

RRF duplicate document score inflation (MEDIUM severity):
If a retriever returned the same document_id multiple times, the RRF
score accumulated multiple rank contributions while the rank only
recorded the first occurrence. Moved the score accumulation inside the
`if is_none` guard so only the first occurrence per list contributes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Taylor Eernisse
2026-02-04 14:16:38 -05:00
parent 266ed78e73
commit 1fdc6d03cc
3 changed files with 279 additions and 235 deletions

View File

@@ -122,28 +122,30 @@ pub fn complete_job(conn: &Connection, job_id: i64) -> Result<()> {
/// Mark a job as failed. Increments attempts, sets next_retry_at with exponential
/// backoff, clears locked_at, and records the error.
///
/// Backoff: 30s * 2^(attempts-1), capped at 480s.
/// Backoff: 30s * 2^(attempts), capped at 480s. Uses a single atomic UPDATE
/// to avoid a read-then-write race on the `attempts` counter.
pub fn fail_job(conn: &Connection, job_id: i64, error: &str) -> Result<()> {
let now = now_ms();
// Get current attempts (propagate error if job no longer exists)
let current_attempts: i32 = conn.query_row(
"SELECT attempts FROM pending_dependent_fetches WHERE id = ?1",
rusqlite::params![job_id],
|row| row.get(0),
)?;
let new_attempts = current_attempts + 1;
let backoff_ms: i64 = (30_000i64 * (1i64 << (new_attempts - 1).min(4))).min(480_000);
let next_retry = now + backoff_ms;
conn.execute(
// Atomic increment + backoff calculation in one UPDATE.
// MIN(attempts, 4) caps the shift to prevent overflow; the overall
// backoff is clamped to 480 000 ms via MIN(..., 480000).
let changes = conn.execute(
"UPDATE pending_dependent_fetches
SET attempts = ?1, next_retry_at = ?2, locked_at = NULL, last_error = ?3
WHERE id = ?4",
rusqlite::params![new_attempts, next_retry, error, job_id],
SET attempts = attempts + 1,
next_retry_at = ?1 + MIN(30000 * (1 << MIN(attempts, 4)), 480000),
locked_at = NULL,
last_error = ?2
WHERE id = ?3",
rusqlite::params![now, error, job_id],
)?;
if changes == 0 {
return Err(crate::core::error::LoreError::Other(
"fail_job: job not found (may have been reclaimed or completed)".into(),
));
}
Ok(())
}