Files

Taylor Eernisse 8dc479e515 docs: add lore who command design plan with 8 iterations of review feedback

Design document for `lore who` — a people intelligence query layer over
existing GitLab data (280K notes, 210K discussions, 33K DiffNotes, 53
participants). Answers five collaboration questions: expert lookup by
file/path, workload summary, review pattern analysis, active discussion
tracking, and file overlap detection.

Key design decisions refined across 8 feedback iterations:
- All SQL is fully static (no format!()) with prepare_cached() throughout
- Exact vs prefix path matching via PathQuery struct (two static SQL variants)
- Self-review exclusion (author != reviewer) on all DiffNote branches
- Deterministic output: sorted GROUP_CONCAT results, stable tie-breakers
- Bounded payloads with *_total/*_truncated metadata for robot consumers
- Truncation transparency via LIMIT+1 overflow detection pattern
- Robot JSON includes resolved_input for reproducibility (since_mode tri-state)
- Multi-project correctness with project-qualified entity references
- Composite migration indexes designed for query selectivity on hot paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-07 21:35:05 -05:00

20 KiB

Raw Permalink Blame History

Below are the highest-leverage revisions I’d make for iteration 8, staying within your MVP constraints (static SQL, no scope creep into new data sources), but tightening correctness, index utilization predictability, debuggability, and output safety.

Fix the semantic bug in since_was_default (Workload mode) by introducing since_mode Why this is better

Right now since_was_default = args.since.is_none() is misleading for Workload, because Workload has no default window (it’s “unbounded unless explicitly filtered”). In robot mode, this creates incorrect intent replay and ambiguity.

Replace the boolean with a tri-state:

since_mode: "default" | "explicit" | "none"

Keep since_was_default only if you want backward compatibility, but compute it as since_mode == "default".

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ -5. Robot-first reproducibility. Robot JSON output includes both a raw input object (echoing CLI args) and a resolved_input object (computed since_ms, since_iso, since_was_default, resolved project_id + project_path, effective mode, limit) so agents can trace exactly what ran and reproduce it precisely. +5. Robot-first reproducibility. Robot JSON output includes both a raw input object (echoing CLI args) and a resolved_input object (computed since_ms, since_iso, since_mode, resolved project_id + project_path, effective mode, limit) so agents can trace exactly what ran and reproduce it precisely. @@ pub struct WhoResolvedInput { pub mode: String, pub project_id: Option, pub project_path: Option, pub since_ms: Option, pub since_iso: Option,

pub since_was_default: bool,

/// "default" (mode default applied), "explicit" (user provided --since), "none" (no window)
pub since_mode: String, pub limit: usize, } @@

let since_was_default = args.since.is_none();

// since_mode semantics:
// - expert/reviews/active/overlap: default window applies if args.since is None
// - workload: no default window; args.since None => "none"
let since_mode_for_defaulted = if args.since.is_some() { "explicit" } else { "default" };
let since_mode_for_workload = if args.since.is_some() { "explicit" } else { "none" }; @@ WhoMode::Expert { path } => { let since_ms = resolve_since(args.since.as_deref(), "6m")?; let result = query_expert(&conn, path, project_id, since_ms, args.limit)?; Ok(WhoRun { resolved_input: WhoResolvedInput { mode: "expert".to_string(), project_id, project_path, since_ms: Some(since_ms), since_iso: Some(ms_to_iso(since_ms)),

```
               since_was_default,
```

               since_mode: since_mode_for_defaulted.to_string(),
               limit: args.limit,
           },
           result: WhoResult::Expert(result),
       })
   }

@@ WhoMode::Workload { username } => { let since_ms = args .since .as_deref() .map(|s| resolve_since_required(s)) .transpose()?; let result = query_workload(&conn, username, project_id, since_ms, args.limit)?; Ok(WhoRun { resolved_input: WhoResolvedInput { mode: "workload".to_string(), project_id, project_path, since_ms, since_iso: since_ms.map(ms_to_iso),

```
               since_was_default,
```

               since_mode: since_mode_for_workload.to_string(),
               limit: args.limit,
           },
           result: WhoResult::Workload(result),
       })
   }

@@ fn print_who_json(run: &WhoRun, args: &WhoArgs, elapsed_ms: u64) { @@ let resolved_input = serde_json::json!({ "mode": run.resolved_input.mode, "project_id": run.resolved_input.project_id, "project_path": run.resolved_input.project_path, "since_ms": run.resolved_input.since_ms, "since_iso": run.resolved_input.since_iso,

   "since_was_default": run.resolved_input.since_was_default,

   "since_mode": run.resolved_input.since_mode,
   "limit": run.resolved_input.limit,

}); }

Stop using nullable-OR ((? IS NULL OR col = ?)) where it determines the “right” index (Active is the big one) Why this is better

Your global vs project-scoped Active indexes are correct, but the nullable binding pattern undermines them because SQLite’s planner can’t assume whether ?2 is NULL at prepare time. Result: it can pick a “good enough for both” plan, which is often the wrong one for -p.

Fix: keep SQL static, but use two static statements selected at runtime (like you already do for exact vs prefix path matching).

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ -1. Lean on existing infrastructure. Use (?N IS NULL OR ...) nullable binding pattern (already used in timeline_seed.rs) instead of dynamic SQL string assembly. +1. Lean on existing infrastructure. Prefer (?N IS NULL OR ...) nullable binding for optional filters unless it materially changes index choice. In those cases, select between two static SQL strings at runtime (no format!()), e.g. Active mode uses separate global vs project-scoped statements to ensure the intended index is used. @@ fn query_active( conn: &Connection, project_id: Option, since_ms: i64, limit: usize, ) -> Result { let limit_plus_one = (limit + 1) as i64;

// Total unresolved count
let total_sql =
```
   "SELECT COUNT(*) FROM discussions d
```

    WHERE d.resolvable = 1 AND d.resolved = 0

```
      AND d.last_note_at >= ?1
```

      AND (?2 IS NULL OR d.project_id = ?2)";

// Total unresolved count (two static variants to avoid nullable-OR planner ambiguity)
let total_sql_global =
```
   "SELECT COUNT(*) FROM discussions d
```

    WHERE d.resolvable = 1 AND d.resolved = 0

```
      AND d.last_note_at >= ?1";
```
let total_sql_scoped =
```
   "SELECT COUNT(*) FROM discussions d
```

    WHERE d.resolvable = 1 AND d.resolved = 0

```
      AND d.last_note_at >= ?1
```
```
      AND d.project_id = ?2";
```

let total_unresolved: u32 =

   conn.query_row(total_sql, rusqlite::params![since_ms, project_id], |row| row.get(0))?;

let total_unresolved: u32 = match project_id {

   None => conn.query_row(total_sql_global, rusqlite::params![since_ms], |row| row.get(0))?,

   Some(pid) => conn.query_row(total_sql_scoped, rusqlite::params![since_ms, pid], |row| row.get(0))?,

};

let sql = "

let sql_global = " WITH picked AS ( SELECT d.id, d.noteable_type, d.issue_id, d.merge_request_id, d.project_id, d.last_note_at FROM discussions d WHERE d.resolvable = 1 AND d.resolved = 0 AND d.last_note_at >= ?1

         AND (?2 IS NULL OR d.project_id = ?2)
       ORDER BY d.last_note_at DESC
       LIMIT ?2
   ),

@@ ORDER BY p.last_note_at DESC ";

let mut stmt = conn.prepare_cached(sql)?;
let discussions: Vec = stmt

   .query_map(rusqlite::params![since_ms, project_id, limit_plus_one], |row| {

let sql_scoped = "
```
   WITH picked AS (
```

       SELECT d.id, d.noteable_type, d.issue_id, d.merge_request_id,

              d.project_id, d.last_note_at

```
       FROM discussions d
```

       WHERE d.resolvable = 1 AND d.resolved = 0

```
         AND d.last_note_at >= ?1
```
```
         AND d.project_id = ?2
```
```
       ORDER BY d.last_note_at DESC
```
```
       LIMIT ?3
```
```
   ),
```
```
   note_counts AS (
```

       SELECT n.discussion_id, COUNT(*) AS note_count

```
       FROM notes n
```

       JOIN picked p ON p.id = n.discussion_id

```
       WHERE n.is_system = 0
```
```
       GROUP BY n.discussion_id
```
```
   ),
```
```
   participants AS (
```

       SELECT x.discussion_id, GROUP_CONCAT(x.author_username, X'1F') AS participants

```
       FROM (
```

           SELECT DISTINCT n.discussion_id, n.author_username

```
           FROM notes n
```

           JOIN picked p ON p.id = n.discussion_id

           WHERE n.is_system = 0 AND n.author_username IS NOT NULL

```
       ) x
```
```
       GROUP BY x.discussion_id
```
```
   )
```
```
   SELECT
```
```
       p.id AS discussion_id,
```
```
       p.noteable_type,
```

       COALESCE(i.iid, m.iid) AS entity_iid,

       COALESCE(i.title, m.title) AS entity_title,

```
       proj.path_with_namespace,
```
```
       p.last_note_at,
```

       COALESCE(nc.note_count, 0) AS note_count,

       COALESCE(pa.participants, '') AS participants

```
   FROM picked p
```

   JOIN projects proj ON p.project_id = proj.id

   LEFT JOIN issues i ON p.issue_id = i.id

   LEFT JOIN merge_requests m ON p.merge_request_id = m.id

   LEFT JOIN note_counts nc ON nc.discussion_id = p.id

   LEFT JOIN participants pa ON pa.discussion_id = p.id

```
   ORDER BY p.last_note_at DESC
```
";
let discussions: Vec = match project_id {
```
   None => {
```

       let mut stmt = conn.prepare_cached(sql_global)?;

       stmt.query_map(rusqlite::params![since_ms, limit_plus_one], |row| {

```
           /* unchanged row mapping */
```

       })?.collect::<std::result::Result<Vec<_>, _>>()?

```
   }
```
```
   Some(pid) => {
```

       let mut stmt = conn.prepare_cached(sql_scoped)?;

       stmt.query_map(rusqlite::params![since_ms, pid, limit_plus_one], |row| {

```
           /* unchanged row mapping */
```

       })?.collect::<std::result::Result<Vec<_>, _>>()?

```
   }
```
};

Also update Verification to explicitly check both variants:

diff Copy code @@

Performance verification (required before merge):

@@ sqlite3 path/to/db.sqlite " EXPLAIN QUERY PLAN SELECT d.id, d.last_note_at FROM discussions d WHERE d.resolvable = 1 AND d.resolved = 0 AND d.last_note_at >= 0 ORDER BY d.last_note_at DESC LIMIT 20; "

Expected: SEARCH discussions USING INDEX idx_discussions_unresolved_recent_global

+sqlite3 path/to/db.sqlite "

EXPLAIN QUERY PLAN
SELECT d.id, d.last_note_at
FROM discussions d
WHERE d.resolvable = 1 AND d.resolved = 0
AND d.project_id = 1
AND d.last_note_at >= 0
ORDER BY d.last_note_at DESC
LIMIT 20; +" +# Expected: SEARCH discussions USING INDEX idx_discussions_unresolved_recent

Add repo-path normalization (eliminate trivial “no results” footguns) Why this is better

People paste:

./src/foo/

/src/foo/

src\foo\bar.rs (Windows) These currently lead to silent misses.

Normalize only user input (not DB content):

trim whitespace

strip leading ./ and /

convert \ → / when present

collapse repeated //

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ fn resolve_mode<'a>(args: &'a WhoArgs) -> Result<WhoMode<'a>> { @@

if let Some(p) = &args.path {

   return Ok(WhoMode::Expert { path: p });

if let Some(p) = &args.path {
```
   let norm = normalize_repo_path(p);
```

   return Ok(WhoMode::Expert { path: Box::leak(norm.into_boxed_str()) });

} @@

if let Some(path) = &args.overlap {

   return Ok(WhoMode::Overlap { path });

if let Some(path) = &args.overlap {

   let norm = normalize_repo_path(path);

   return Ok(WhoMode::Overlap { path: Box::leak(norm.into_boxed_str()) });

} @@

```
   if target.contains('/') {
```

       return Ok(WhoMode::Expert { path: target });

```
   if target.contains('/') {
```

       let norm = normalize_repo_path(target);

       return Ok(WhoMode::Expert { path: Box::leak(norm.into_boxed_str()) });
   }

@@ } + +/// Normalize user-supplied repo paths to match stored DiffNote paths. +/// - trims whitespace +/// - strips leading "./" and "/" (repo-relative) +/// - converts '' to '/' (Windows paste) +/// - collapses repeated slashes +fn normalize_repo_path(input: &str) -> String {

let mut s = input.trim().to_string();
if s.contains('\') && !s.contains('/') {
```
   s = s.replace('\\', "/");
```
}
while s.starts_with("./") {

   s = s.trim_start_matches("./").to_string();

}
while s.starts_with('/') {

   s = s.trim_start_matches('/').to_string();

}
while s.contains("//") {
```
   s = s.replace("//", "/");
```
}
s +}

(Add a small test block for normalization; even 2–3 asserts catch regressions.)

Make path matching observable: include path_match (exact vs prefix) in results/JSON Why this is better

You’ve made path classification smarter (heuristics + two-way probe). That’s great, but without visibility you’ll get “why did it treat this as a directory?” confusion. Exposing match metadata is low cost and hugely helps debugging.

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ -struct PathQuery {

/// The parameter value to bind.
value: String,
/// If true: use LIKE value ESCAPE '\'. If false: use = value.
is_prefix: bool, -} +struct PathQuery {

/// User input after normalization (no trailing slash stripping yet).
input: String,
/// Trimmed path without trailing '/' used for exact/prefix construction.
normalized: String,
/// The SQL parameter bound to the statement (foo/bar or foo/bar/%).
sql_value: String,
/// If true: use LIKE sql_value ESCAPE '\'. If false: use = normalized.
is_prefix: bool, +} @@

let trimmed = path.trim_end_matches('/');

let input = normalize_repo_path(path);
let trimmed = input.trim_end_matches('/').to_string(); @@

```
   Ok(PathQuery {
```
```
       value: trimmed.to_string(),
```
```
       is_prefix: false,
```
```
   })
```

   Ok(PathQuery { input, normalized: trimmed.clone(), sql_value: trimmed, is_prefix: false })

} else {

```
   Ok(PathQuery {
```
```
       value: format!("{escaped}/%"),
```
```
       is_prefix: true,
```
```
   })
```

   Ok(PathQuery { input, normalized: trimmed.clone(), sql_value: format!("{escaped}/%"), is_prefix: true })

} @@ pub struct ExpertResult { pub path_query: String,

pub path_match: String, // "exact" or "prefix" pub experts: Vec, pub truncated: bool, } @@ pub struct OverlapResult { pub path_query: String,
pub path_match: String, // "exact" or "prefix" pub users: Vec, pub truncated: bool, } @@ fn query_expert(...) -> Result { let pq = build_path_query(conn, path, project_id)?; @@ Ok(ExpertResult { path_query: path.to_string(),
```
   path_match: if pq.is_prefix { "prefix".to_string() } else { "exact".to_string() },
   experts,
   truncated,
```
}) } @@ fn query_overlap(...) -> Result { let pq = build_path_query(conn, path, project_id)?; @@ Ok(OverlapResult { path_query: path.to_string(),

   path_match: if pq.is_prefix { "prefix".to_string() } else { "exact".to_string() },
   users,
   truncated,

}) } @@ fn expert_to_json(r: &ExpertResult) -> serde_json::Value { serde_json::json!({ "path_query": r.path_query,

```
   "path_match": r.path_match,
   "truncated": r.truncated,
   "experts": ...
```
}) } @@ fn overlap_to_json(r: &OverlapResult) -> serde_json::Value { serde_json::json!({ "path_query": r.path_query,

   "path_match": r.path_match,
   "truncated": r.truncated,
   "users": ...

}) }

Human output can add a single dim hint line:

(matching exact file) or (matching directory prefix)

Put a hard upper bound on --limit at the CLI boundary Why this is better

You already bounded nested arrays (participants, mr_refs), but top-level lists are still user-unbounded. A single --limit 50000 can:

generate huge JSON payloads

blow up downstream agent pipelines

create slow queries / memory spikes

Clamp it before execution. A max of 500 is usually plenty; even 200 is fine.

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ pub struct WhoArgs { @@

/// Maximum results per section
#[arg(short = 'n', long = "limit", default_value = "20", help_heading = "Output")]

/// Maximum results per section (bounded for output safety)
#[arg(
```
   short = 'n',
```
```
   long = "limit",
```
```
   default_value = "20",
```

   value_parser = clap::value_parser!(u16).range(1..=500),

```
   help_heading = "Output"
```
)] pub limit: usize, } @@ -11. Bounded payloads. Robot JSON must never emit unbounded arrays ... +11. Bounded payloads. Robot JSON must never emit unbounded arrays ...
Top-level result set size is also bounded via --limit (1..=500) to prevent runaway payloads.

Clarify Active “unresolved count” semantics (window vs total) Why this is better

total_unresolved currently means “unresolved within the time window”. The human header prints “Active Discussions (X unresolved)” which can easily be misread as “total unresolved overall”.

Small rename avoids confusion, no new behavior.

Patch diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ pub struct ActiveResult { pub discussions: Vec,

pub total_unresolved: u32,

pub total_unresolved_in_window: u32, pub truncated: bool, } @@

println!(
```
   "{}",
```
```
   style(format!(
```

       "Active Discussions ({} unresolved)",

```
       r.total_unresolved
```
```
   ))
```
```
   .bold()
```
);

println!("{}", style(format!(

   "Active Discussions ({} unresolved in window)",

```
   r.total_unresolved_in_window
```
)).bold());

(If you later want global total, add a second count query—but I’d keep MVP lean.)

Tighten statement cache behavior: avoid preparing both SQL variants when not needed Why this is better

You already use prepare_cached(), but as you add more “two static variants” (exact/prefix; scoped/unscoped), it’s easy to accidentally prepare multiple statements per invocation.

Codify: select variant first, then prepare exactly one.

This is mostly a plan hygiene change (helps future you keep perf predictable).

Patch (plan-level emphasis) diff Copy code diff --git a/who-command-design.md b/who-command-design.md --- a/who-command-design.md +++ b/who-command-design.md @@ -1. Lean on existing infrastructure. ... +1. Lean on existing infrastructure. ...

When multiple static SQL variants exist (exact/prefix; scoped/unscoped), always:
(a) resolve which variant applies, then (b) prepare_cached() exactly one statement.

Net effect (what you gain)

Correct robot semantics (since_mode) without breaking your static-SQL/agent-first contract.

Guaranteed intended index usage for Active global vs scoped queries (the nullable-OR planner pitfall is real).

Fewer “why no results?” surprises via path normalization.

Better debugging (path match introspection) with essentially no runtime cost.

Output safety even when users/agents misconfigure --limit.

Less ambiguous UX around “unresolved” counts.

If you want a single “most important” change to ship before iteration 8 locks: #2 (Active query variants) and #1 (since semantics) are the two that prevent the most painful, hard-to-diagnose failures.

20 KiB Raw Permalink Blame History Unescape Escape

Performance verification (required before merge):

Expected: SEARCH discussions USING INDEX idx_discussions_unresolved_recent_global

20 KiB

Raw Permalink Blame History