Design document for `lore who` — a people intelligence query layer over existing GitLab data (280K notes, 210K discussions, 33K DiffNotes, 53 participants). Answers five collaboration questions: expert lookup by file/path, workload summary, review pattern analysis, active discussion tracking, and file overlap detection. Key design decisions refined across 8 feedback iterations: - All SQL is fully static (no format!()) with prepare_cached() throughout - Exact vs prefix path matching via PathQuery struct (two static SQL variants) - Self-review exclusion (author != reviewer) on all DiffNote branches - Deterministic output: sorted GROUP_CONCAT results, stable tie-breakers - Bounded payloads with *_total/*_truncated metadata for robot consumers - Truncation transparency via LIMIT+1 overflow detection pattern - Robot JSON includes resolved_input for reproducibility (since_mode tri-state) - Multi-project correctness with project-qualified entity references - Composite migration indexes designed for query selectivity on hot paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
471 lines
15 KiB
Markdown
471 lines
15 KiB
Markdown
Proposed revisions (Iteration 6)
|
||
|
||
Below are the highest-leverage changes I’d make on top of your current Iteration 5 plan, with rationale and git-diff style edits to the plan text/snippets.
|
||
|
||
1) Fix a real edge case: dotless non-root files (src/Dockerfile, infra/Makefile, etc.)
|
||
Why
|
||
|
||
Your current build_path_query() treats dotless last segments as directories (prefix match) unless the path is root. That misclassifies legitimate dotless files inside directories and silently produces path/% (zero hits or wrong hits).
|
||
|
||
Best minimal fix: keep your static SQL approach, but add a DB existence probe (static SQL) for path queries:
|
||
|
||
If user didn’t force directory (/), and exact path exists in DiffNotes, treat as exact =.
|
||
|
||
Otherwise use prefix LIKE 'dir/%'.
|
||
|
||
This avoids new CLI flags, avoids heuristics lists, and uses your existing partial index (idx_notes_diffnote_path_created) efficiently.
|
||
|
||
Diff
|
||
diff
|
||
Copy code
|
||
diff --git a/Plan.md b/Plan.md
|
||
@@
|
||
-struct PathQuery {
|
||
+struct PathQuery {
|
||
/// The parameter value to bind.
|
||
value: String,
|
||
/// If true: use `LIKE value ESCAPE '\'`. If false: use `= value`.
|
||
is_prefix: bool,
|
||
}
|
||
|
||
-/// Build a path query from a user-supplied path.
|
||
+/// Build a path query from a user-supplied path, with a DB probe for dotless files.
|
||
@@
|
||
-fn build_path_query(path: &str) -> PathQuery {
|
||
+fn build_path_query(conn: &Connection, path: &str) -> Result<PathQuery> {
|
||
let trimmed = path.trim_end_matches('/');
|
||
let last_segment = trimmed.rsplit('/').next().unwrap_or(trimmed);
|
||
let is_root = !trimmed.contains('/');
|
||
- let is_file = !path.ends_with('/') && (is_root || last_segment.contains('.'));
|
||
+ let forced_dir = path.ends_with('/');
|
||
+ let looks_like_file = !forced_dir && (is_root || last_segment.contains('.'));
|
||
+
|
||
+ // If it doesn't "look like a file" but the exact path exists in DiffNotes,
|
||
+ // treat as exact (handles src/Dockerfile, infra/Makefile, etc.).
|
||
+ let exact_exists = if !looks_like_file && !forced_dir {
|
||
+ conn.query_row(
|
||
+ "SELECT 1
|
||
+ FROM notes
|
||
+ WHERE note_type = 'DiffNote'
|
||
+ AND is_system = 0
|
||
+ AND position_new_path = ?1
|
||
+ LIMIT 1",
|
||
+ rusqlite::params![trimmed],
|
||
+ |_| Ok(()),
|
||
+ ).is_ok()
|
||
+ } else {
|
||
+ false
|
||
+ };
|
||
+
|
||
+ let is_file = looks_like_file || exact_exists;
|
||
|
||
if is_file {
|
||
PathQuery {
|
||
value: trimmed.to_string(),
|
||
is_prefix: false,
|
||
}
|
||
} else {
|
||
let escaped = escape_like(trimmed);
|
||
PathQuery {
|
||
value: format!("{escaped}/%"),
|
||
is_prefix: true,
|
||
}
|
||
}
|
||
}
|
||
|
||
|
||
Also update callers:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
- let pq = build_path_query(path);
|
||
+ let pq = build_path_query(conn, path)?;
|
||
@@
|
||
- let pq = build_path_query(path);
|
||
+ let pq = build_path_query(conn, path)?;
|
||
|
||
|
||
And tests:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
- fn test_build_path_query() {
|
||
+ fn test_build_path_query() {
|
||
@@
|
||
- // Dotless root file -> exact match (root path without '/')
|
||
+ // Dotless root file -> exact match (root path without '/')
|
||
let pq = build_path_query("Makefile");
|
||
assert_eq!(pq.value, "Makefile");
|
||
assert!(!pq.is_prefix);
|
||
+
|
||
+ // Dotless file in subdir should become exact if DB contains it (probe)
|
||
+ // (set up: insert one DiffNote with position_new_path = "src/Dockerfile")
|
||
|
||
2) Make “reviewer” semantics correct: exclude MR authors commenting on their own diffs
|
||
Why
|
||
|
||
Right now, Overlap (and Expert reviewer branch) will count MR authors as “reviewers” if they leave DiffNotes in their own MR (clarifications / replies), inflating A+R and contaminating “who reviewed here” signals.
|
||
|
||
You already enforce this in --reviews mode (m.author_username != ?1). Apply the same principle consistently:
|
||
|
||
Reviewer branch: only count notes where n.author_username != m.author_username (when both non-NULL).
|
||
|
||
Diff (Overlap reviewer branch)
|
||
diff
|
||
Copy code
|
||
@@
|
||
- WHERE n.note_type = 'DiffNote'
|
||
+ WHERE n.note_type = 'DiffNote'
|
||
AND n.position_new_path LIKE ?1 ESCAPE '\\'
|
||
AND n.is_system = 0
|
||
AND n.author_username IS NOT NULL
|
||
+ AND (m.author_username IS NULL OR n.author_username != m.author_username)
|
||
AND n.created_at >= ?2
|
||
AND (?3 IS NULL OR n.project_id = ?3)
|
||
|
||
|
||
Same change for sql_exact.
|
||
|
||
3) Expert mode scoring: align units + reduce single-MR “comment storms”
|
||
Why
|
||
|
||
Expert currently mixes units:
|
||
|
||
reviewer side: DiffNote count
|
||
|
||
author side: distinct MR count
|
||
|
||
That makes score noisy and can crown “someone who wrote 30 comments on one MR” as top expert.
|
||
|
||
Fix: make both sides primarily MR-breadth:
|
||
|
||
reviewer: COUNT(DISTINCT m.id) as review_mr_count
|
||
|
||
author: COUNT(DISTINCT m.id) as author_mr_count
|
||
Optionally keep review_note_count as a secondary intensity signal (but not the main driver).
|
||
|
||
Diff (types + SQL)
|
||
diff
|
||
Copy code
|
||
@@
|
||
pub struct Expert {
|
||
pub username: String,
|
||
- pub score: f64,
|
||
- pub review_count: u32,
|
||
- pub author_count: u32,
|
||
+ pub score: i64,
|
||
+ pub review_mr_count: u32,
|
||
+ pub review_note_count: u32,
|
||
+ pub author_mr_count: u32,
|
||
pub last_active_ms: i64,
|
||
}
|
||
|
||
|
||
Reviewer branch now joins to MR so it can count distinct MRs and exclude self-comments:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
- SELECT
|
||
- n.author_username AS username,
|
||
- 'reviewer' AS role,
|
||
- COUNT(*) AS cnt,
|
||
- MAX(n.created_at) AS last_active_at
|
||
- FROM notes n
|
||
+ SELECT
|
||
+ n.author_username AS username,
|
||
+ 'reviewer' AS role,
|
||
+ COUNT(DISTINCT m.id) AS mr_cnt,
|
||
+ COUNT(*) AS note_cnt,
|
||
+ MAX(n.created_at) AS last_active_at
|
||
+ FROM notes n
|
||
+ JOIN discussions d ON n.discussion_id = d.id
|
||
+ JOIN merge_requests m ON d.merge_request_id = m.id
|
||
WHERE n.note_type = 'DiffNote'
|
||
AND n.is_system = 0
|
||
AND n.author_username IS NOT NULL
|
||
+ AND (m.author_username IS NULL OR n.author_username != m.author_username)
|
||
AND n.position_new_path LIKE ?1 ESCAPE '\\'
|
||
AND n.created_at >= ?2
|
||
AND (?3 IS NULL OR n.project_id = ?3)
|
||
GROUP BY n.author_username
|
||
|
||
|
||
Update author branch payload to match shape:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
SELECT
|
||
m.author_username AS username,
|
||
'author' AS role,
|
||
- COUNT(DISTINCT m.id) AS cnt,
|
||
+ COUNT(DISTINCT m.id) AS mr_cnt,
|
||
+ 0 AS note_cnt,
|
||
MAX(n.created_at) AS last_active_at
|
||
|
||
|
||
Aggregate:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
SELECT
|
||
username,
|
||
- SUM(CASE WHEN role = 'reviewer' THEN cnt ELSE 0 END) AS review_count,
|
||
- SUM(CASE WHEN role = 'author' THEN cnt ELSE 0 END) AS author_count,
|
||
+ SUM(CASE WHEN role = 'reviewer' THEN mr_cnt ELSE 0 END) AS review_mr_count,
|
||
+ SUM(CASE WHEN role = 'reviewer' THEN note_cnt ELSE 0 END) AS review_note_count,
|
||
+ SUM(CASE WHEN role = 'author' THEN mr_cnt ELSE 0 END) AS author_mr_count,
|
||
MAX(last_active_at) AS last_active_at,
|
||
- (SUM(CASE WHEN role = 'reviewer' THEN cnt ELSE 0 END) * 3.0) +
|
||
- (SUM(CASE WHEN role = 'author' THEN cnt ELSE 0 END) * 2.0) AS score
|
||
+ (
|
||
+ (SUM(CASE WHEN role = 'reviewer' THEN mr_cnt ELSE 0 END) * 20) +
|
||
+ (SUM(CASE WHEN role = 'author' THEN mr_cnt ELSE 0 END) * 12) +
|
||
+ (SUM(CASE WHEN role = 'reviewer' THEN note_cnt ELSE 0 END) * 1)
|
||
+ ) AS score
|
||
|
||
|
||
Human header:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
- style("Reviews").bold(),
|
||
- style("Authored").bold(),
|
||
+ style("Reviewed(MRs)").bold(),
|
||
+ style("Notes").bold(),
|
||
+ style("Authored(MRs)").bold(),
|
||
|
||
4) Deterministic output: participants + MR refs + tie-breakers
|
||
Why
|
||
|
||
You’ve correctly focused on reproducibility (resolved_input), but you still have nondeterministic lists:
|
||
|
||
participants: GROUP_CONCAT order is undefined → vector order changes run-to-run.
|
||
|
||
mr_refs: you dedup via HashSet then iterate → undefined order.
|
||
|
||
user sorting in overlap is missing stable tie-breakers.
|
||
|
||
This is a real “robot mode flake” source.
|
||
|
||
Diff (Active participants sort)
|
||
diff
|
||
Copy code
|
||
@@
|
||
- let participants: Vec<String> = participants_csv
|
||
+ let mut participants: Vec<String> = participants_csv
|
||
.as_deref()
|
||
.filter(|s| !s.is_empty())
|
||
.map(|csv| csv.split('\x1F').map(String::from).collect())
|
||
.unwrap_or_default();
|
||
+ participants.sort(); // stable, deterministic
|
||
|
||
Diff (Overlap MR refs sort + stable user sort)
|
||
diff
|
||
Copy code
|
||
@@
|
||
- users.sort_by(|a, b| b.touch_count.cmp(&a.touch_count));
|
||
+ users.sort_by(|a, b| {
|
||
+ b.touch_count.cmp(&a.touch_count)
|
||
+ .then_with(|| b.last_touch_at.cmp(&a.last_touch_at))
|
||
+ .then_with(|| a.username.cmp(&b.username))
|
||
+ });
|
||
@@
|
||
- entry.mr_refs = set.into_iter().collect();
|
||
+ let mut v: Vec<String> = set.into_iter().collect();
|
||
+ v.sort();
|
||
+ entry.mr_refs = v;
|
||
|
||
5) Make --limit actionable: surface truncation explicitly (human + robot)
|
||
Why
|
||
|
||
Agents (and humans) need to know if results were cut off so they can rerun with a bigger -n.
|
||
Right now there’s no signal.
|
||
|
||
Minimal pattern: query limit + 1, set truncated = true if you got > limit, then truncate.
|
||
|
||
Diff (result types)
|
||
diff
|
||
Copy code
|
||
@@
|
||
pub struct ExpertResult {
|
||
pub path_query: String,
|
||
pub experts: Vec<Expert>,
|
||
+ pub truncated: bool,
|
||
}
|
||
@@
|
||
pub struct ActiveResult {
|
||
pub discussions: Vec<ActiveDiscussion>,
|
||
pub total_unresolved: u32,
|
||
+ pub truncated: bool,
|
||
}
|
||
@@
|
||
pub struct OverlapResult {
|
||
pub path_query: String,
|
||
pub users: Vec<OverlapUser>,
|
||
+ pub truncated: bool,
|
||
}
|
||
|
||
Diff (query pattern example)
|
||
diff
|
||
Copy code
|
||
@@
|
||
- let limit_i64 = limit as i64;
|
||
+ let limit_plus_one = (limit + 1) as i64;
|
||
@@
|
||
- LIMIT ?4
|
||
+ LIMIT ?4
|
||
@@
|
||
- rusqlite::params![pq.value, since_ms, project_id, limit_i64],
|
||
+ rusqlite::params![pq.value, since_ms, project_id, limit_plus_one],
|
||
@@
|
||
- Ok(ExpertResult {
|
||
+ let truncated = experts.len() > limit;
|
||
+ let experts = experts.into_iter().take(limit).collect();
|
||
+ Ok(ExpertResult {
|
||
path_query: path.to_string(),
|
||
experts,
|
||
+ truncated,
|
||
})
|
||
|
||
|
||
Human output hint:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
if r.experts.is_empty() { ... }
|
||
+ if r.truncated {
|
||
+ println!(" {}", style("(showing first -n; rerun with a higher --limit)").dim());
|
||
+ }
|
||
|
||
|
||
Robot output field:
|
||
|
||
diff
|
||
Copy code
|
||
@@
|
||
fn expert_to_json(r: &ExpertResult) -> serde_json::Value {
|
||
serde_json::json!({
|
||
"path_query": r.path_query,
|
||
+ "truncated": r.truncated,
|
||
"experts": ...
|
||
})
|
||
}
|
||
|
||
6) Overlap merge hot loop: avoid repeated HashSet rebuild per row
|
||
Why
|
||
|
||
This line is expensive in a UNION result with many rows:
|
||
|
||
rust
|
||
Copy code
|
||
let mut set: HashSet<String> = entry.mr_refs.drain(..).collect();
|
||
|
||
|
||
It reallocates and rehashes every time.
|
||
|
||
Fix: store an accumulator with HashSet during merge, convert once at end.
|
||
|
||
Diff (internal accumulator)
|
||
diff
|
||
Copy code
|
||
@@
|
||
- let mut user_map: HashMap<String, OverlapUser> = HashMap::new();
|
||
+ struct OverlapAcc {
|
||
+ username: String,
|
||
+ author_touch_count: u32,
|
||
+ review_touch_count: u32,
|
||
+ touch_count: u32,
|
||
+ last_touch_at: i64,
|
||
+ mr_refs: HashSet<String>,
|
||
+ }
|
||
+ let mut user_map: HashMap<String, OverlapAcc> = HashMap::new();
|
||
@@
|
||
- let entry = user_map.entry(username.clone()).or_insert_with(|| OverlapUser {
|
||
+ let entry = user_map.entry(username.clone()).or_insert_with(|| OverlapAcc {
|
||
username: username.clone(),
|
||
author_touch_count: 0,
|
||
review_touch_count: 0,
|
||
touch_count: 0,
|
||
last_touch_at: 0,
|
||
- mr_refs: Vec::new(),
|
||
+ mr_refs: HashSet::new(),
|
||
});
|
||
@@
|
||
- let mut set: HashSet<String> = entry.mr_refs.drain(..).collect();
|
||
- for r in mr_refs { set.insert(r); }
|
||
- entry.mr_refs = set.into_iter().collect();
|
||
+ for r in mr_refs { entry.mr_refs.insert(r); }
|
||
@@
|
||
- let mut users: Vec<OverlapUser> = user_map.into_values().collect();
|
||
+ let mut users: Vec<OverlapUser> = user_map.into_values().map(|a| {
|
||
+ let mut mr_refs: Vec<String> = a.mr_refs.into_iter().collect();
|
||
+ mr_refs.sort();
|
||
+ OverlapUser {
|
||
+ username: a.username,
|
||
+ author_touch_count: a.author_touch_count,
|
||
+ review_touch_count: a.review_touch_count,
|
||
+ touch_count: a.touch_count,
|
||
+ last_touch_at: a.last_touch_at,
|
||
+ mr_refs,
|
||
+ }
|
||
+ }).collect();
|
||
|
||
7) Tests to lock these behaviors
|
||
Add tests (high value)
|
||
|
||
dotless subdir file uses DB probe → exact match
|
||
|
||
self-review exclusion prevents MR author showing up as reviewer
|
||
|
||
deterministic ordering for participants and mr_refs (sort)
|
||
|
||
Diff (test additions outline)
|
||
diff
|
||
Copy code
|
||
@@
|
||
#[test]
|
||
+ fn test_build_path_query_dotless_subdir_file_uses_probe() {
|
||
+ let conn = setup_test_db();
|
||
+ insert_project(&conn, 1, "team/backend");
|
||
+ insert_mr(&conn, 1, 1, 100, "author_a", "opened");
|
||
+ insert_discussion(&conn, 1, 1, Some(1), None, true, false);
|
||
+ insert_diffnote(&conn, 1, 1, 1, "reviewer_b", "src/Dockerfile", "note");
|
||
+
|
||
+ let pq = build_path_query(&conn, "src/Dockerfile").unwrap();
|
||
+ assert_eq!(pq.value, "src/Dockerfile");
|
||
+ assert!(!pq.is_prefix);
|
||
+ }
|
||
+
|
||
+ #[test]
|
||
+ fn test_overlap_excludes_self_review_notes() {
|
||
+ let conn = setup_test_db();
|
||
+ insert_project(&conn, 1, "team/backend");
|
||
+ insert_mr(&conn, 1, 1, 100, "author_a", "opened");
|
||
+ insert_discussion(&conn, 1, 1, Some(1), None, true, false);
|
||
+ // author_a comments on their own MR diff
|
||
+ insert_diffnote(&conn, 1, 1, 1, "author_a", "src/auth/login.rs", "clarification");
|
||
+
|
||
+ let result = query_overlap(&conn, "src/auth/", None, 0, 20).unwrap();
|
||
+ let u = result.users.iter().find(|u| u.username == "author_a");
|
||
+ // should not be credited as reviewer touch
|
||
+ assert!(u.map(|x| x.review_touch_count).unwrap_or(0) == 0);
|
||
+ }
|
||
|
||
Net effect
|
||
|
||
Correctness: fixes dotless subdir files + self-review pollution.
|
||
|
||
Signal quality: Expert ranking becomes harder to game by comment volume.
|
||
|
||
Robot reproducibility: deterministic ordering + explicit truncation.
|
||
|
||
Performance: avoids rehash loops in overlap merges; path probe uses indexed equality.
|
||
|
||
If you want one “single best” change: #1 (DB probe exact-match) is the most likely to prevent confusing “why is this empty?” behavior without adding any user-facing complexity. |