Design document for `lore who` — a people intelligence query layer over existing GitLab data (280K notes, 210K discussions, 33K DiffNotes, 53 participants). Answers five collaboration questions: expert lookup by file/path, workload summary, review pattern analysis, active discussion tracking, and file overlap detection. Key design decisions refined across 8 feedback iterations: - All SQL is fully static (no format!()) with prepare_cached() throughout - Exact vs prefix path matching via PathQuery struct (two static SQL variants) - Self-review exclusion (author != reviewer) on all DiffNote branches - Deterministic output: sorted GROUP_CONCAT results, stable tie-breakers - Bounded payloads with *_total/*_truncated metadata for robot consumers - Truncation transparency via LIMIT+1 overflow detection pattern - Robot JSON includes resolved_input for reproducibility (since_mode tri-state) - Multi-project correctness with project-qualified entity references - Composite migration indexes designed for query selectivity on hot paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
356 lines
12 KiB
Markdown
356 lines
12 KiB
Markdown
Below are the highest-leverage revisions I’d make. They’re tightly scoped (no new tables/APIs), but fix a few real correctness issues and make the outputs more actionable.
|
||
|
||
1) Fix a correctness bug in PathQuery: don’t escape for =, and make --path Makefile actually work
|
||
Why
|
||
|
||
Bug: build_path_query() currently runs escape_like() even when is_prefix = false (exact match). That will break exact matches for paths containing _, %, or \ because = does not treat those as metacharacters (so the escaped string won’t equal the stored path).
|
||
|
||
UX mismatch: The plan says --path handles dotless root files (Makefile/LICENSE), but the current logic still treats them as directory prefixes (Makefile/%) → zero results.
|
||
|
||
Change
|
||
|
||
Only escape for LIKE.
|
||
|
||
Treat root paths (no /) passed via --path as exact matches by default (unless they end with /).
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
-/// Build a path query from a user-supplied path.
|
||
-///
|
||
-/// Rules:
|
||
-/// - If the path ends with `/`, it's a directory prefix -> `escaped_path%` (LIKE)
|
||
-/// - If the last path segment contains `.`, it's a file -> exact match (=)
|
||
-/// - Otherwise, it's a directory prefix -> `escaped_path/%` (LIKE)
|
||
+/// Build a path query from a user-supplied path.
|
||
+///
|
||
+/// Rules:
|
||
+/// - If the path ends with `/`, it's a directory prefix -> `escaped_path/%` (LIKE)
|
||
+/// - If the path is a root path (no `/`) and does NOT end with `/`, treat as exact (=)
|
||
+/// (this makes `--path Makefile` and `--path LICENSE` work as intended)
|
||
+/// - Else if the last path segment contains `.`, treat as exact (=)
|
||
+/// - Otherwise, treat as directory prefix -> `escaped_path/%` (LIKE)
|
||
@@
|
||
-fn build_path_query(path: &str) -> PathQuery {
|
||
+fn build_path_query(path: &str) -> PathQuery {
|
||
let trimmed = path.trim_end_matches('/');
|
||
let last_segment = trimmed.rsplit('/').next().unwrap_or(trimmed);
|
||
- let is_file = !path.ends_with('/') && last_segment.contains('.');
|
||
- let escaped = escape_like(trimmed);
|
||
+ let is_root = !trimmed.contains('/');
|
||
+ let is_file = !path.ends_with('/') && (is_root || last_segment.contains('.'));
|
||
|
||
if is_file {
|
||
PathQuery {
|
||
- value: escaped,
|
||
+ // IMPORTANT: do NOT escape for exact match (=)
|
||
+ value: trimmed.to_string(),
|
||
is_prefix: false,
|
||
}
|
||
} else {
|
||
+ let escaped = escape_like(trimmed);
|
||
PathQuery {
|
||
value: format!("{escaped}/%"),
|
||
is_prefix: true,
|
||
}
|
||
}
|
||
}
|
||
@@
|
||
-/// **Known limitation:** Dotless root files (LICENSE, Makefile, Dockerfile)
|
||
-/// without a trailing `/` will be treated as directory prefixes. Use `--path`
|
||
-/// for these — the `--path` flag passes through to Expert mode directly,
|
||
-/// and the `build_path_query` output for "LICENSE" is a prefix `LICENSE/%`
|
||
-/// which will simply return zero results (a safe, obvious failure mode that the
|
||
-/// help text addresses).
|
||
+/// Note: Root file paths passed via `--path` (including dotless files like Makefile/LICENSE)
|
||
+/// are treated as exact matches unless they end with `/`.
|
||
|
||
|
||
Also update the --path help text to be explicit:
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
- /// Force expert mode for a file/directory path (handles root files like
|
||
- /// README.md, LICENSE, Makefile that lack a / and can't be auto-detected)
|
||
+ /// Force expert mode for a file/directory path.
|
||
+ /// Root files (README.md, LICENSE, Makefile) are treated as exact matches.
|
||
+ /// Use a trailing `/` to force directory-prefix matching.
|
||
|
||
2) Fix Active mode: your note_count is currently counting participants, and the CTE scans too broadly
|
||
Why
|
||
|
||
In note_agg, you do SELECT DISTINCT discussion_id, author_username and then COUNT(*) AS note_count. That’s participant count, not note count.
|
||
|
||
The current note_agg also builds the DISTINCT set from all notes then joins to picked. It’s avoidable work.
|
||
|
||
Change
|
||
|
||
Split into two aggregations scoped to picked:
|
||
|
||
note_counts: counts non-system notes per picked discussion.
|
||
|
||
participants: distinct usernames per picked discussion, then GROUP_CONCAT.
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
- note_agg AS (
|
||
- SELECT
|
||
- n.discussion_id,
|
||
- COUNT(*) AS note_count,
|
||
- GROUP_CONCAT(n.author_username, X'1F') AS participants
|
||
- FROM (
|
||
- SELECT DISTINCT discussion_id, author_username
|
||
- FROM notes
|
||
- WHERE is_system = 0 AND author_username IS NOT NULL
|
||
- ) n
|
||
- JOIN picked p ON p.id = n.discussion_id
|
||
- GROUP BY n.discussion_id
|
||
- )
|
||
+ note_counts AS (
|
||
+ SELECT
|
||
+ n.discussion_id,
|
||
+ COUNT(*) AS note_count
|
||
+ FROM notes n
|
||
+ JOIN picked p ON p.id = n.discussion_id
|
||
+ WHERE n.is_system = 0
|
||
+ GROUP BY n.discussion_id
|
||
+ ),
|
||
+ participants AS (
|
||
+ SELECT
|
||
+ x.discussion_id,
|
||
+ GROUP_CONCAT(x.author_username, X'1F') AS participants
|
||
+ FROM (
|
||
+ SELECT DISTINCT n.discussion_id, n.author_username
|
||
+ FROM notes n
|
||
+ JOIN picked p ON p.id = n.discussion_id
|
||
+ WHERE n.is_system = 0 AND n.author_username IS NOT NULL
|
||
+ ) x
|
||
+ GROUP BY x.discussion_id
|
||
+ )
|
||
@@
|
||
- LEFT JOIN note_agg na ON na.discussion_id = p.id
|
||
+ LEFT JOIN note_counts nc ON nc.discussion_id = p.id
|
||
+ LEFT JOIN participants pa ON pa.discussion_id = p.id
|
||
@@
|
||
- COALESCE(na.note_count, 0) AS note_count,
|
||
- COALESCE(na.participants, '') AS participants
|
||
+ COALESCE(nc.note_count, 0) AS note_count,
|
||
+ COALESCE(pa.participants, '') AS participants
|
||
|
||
|
||
Net effect: correctness fix + more predictable perf.
|
||
|
||
Add a test that would have failed before:
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
#[test]
|
||
fn test_active_query() {
|
||
@@
|
||
- insert_diffnote(&conn, 1, 1, 1, "reviewer_b", "src/foo.rs", "needs work");
|
||
+ insert_diffnote(&conn, 1, 1, 1, "reviewer_b", "src/foo.rs", "needs work");
|
||
+ insert_diffnote(&conn, 2, 1, 1, "reviewer_b", "src/foo.rs", "follow-up");
|
||
@@
|
||
- assert_eq!(result.discussions[0].participants, vec!["reviewer_b"]);
|
||
+ assert_eq!(result.discussions[0].participants, vec!["reviewer_b"]);
|
||
+ assert_eq!(result.discussions[0].note_count, 2);
|
||
|
||
3) Index fix: idx_discussions_unresolved_recent won’t help global --active ordering
|
||
Why
|
||
|
||
Your index is (project_id, last_note_at) with WHERE resolvable=1 AND resolved=0.
|
||
|
||
When --active is not project-scoped (common default), SQLite can’t use (project_id, last_note_at) to satisfy ORDER BY last_note_at DESC efficiently because project_id isn’t constrained.
|
||
|
||
This can turn into a scan+sort over potentially large unresolved sets.
|
||
|
||
Change
|
||
|
||
Keep the project-scoped index, but add a global ordering index (partial, still small):
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
CREATE INDEX IF NOT EXISTS idx_discussions_unresolved_recent
|
||
ON discussions(project_id, last_note_at)
|
||
WHERE resolvable = 1 AND resolved = 0;
|
||
+
|
||
+-- Active (global): unresolved discussions by recency (no project scope).
|
||
+-- Supports ORDER BY last_note_at DESC LIMIT N when project_id is unconstrained.
|
||
+CREATE INDEX IF NOT EXISTS idx_discussions_unresolved_recent_global
|
||
+ ON discussions(last_note_at)
|
||
+ WHERE resolvable = 1 AND resolved = 0;
|
||
|
||
4) Make Overlap “touches” coherent: count MRs for reviewers, not DiffNotes
|
||
Why
|
||
|
||
Overlap’s question is “Who else has MRs touching my files?” but:
|
||
|
||
reviewer branch uses COUNT(*) (DiffNotes)
|
||
|
||
author branch uses COUNT(DISTINCT m.id) (MRs)
|
||
|
||
Those are different units; summing them into touch_count is misleading.
|
||
|
||
Change
|
||
|
||
Count distinct MRs on the reviewer branch too:
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
- COUNT(*) AS touch_count,
|
||
+ COUNT(DISTINCT m.id) AS touch_count,
|
||
MAX(n.created_at) AS last_touch_at,
|
||
GROUP_CONCAT(DISTINCT (p.path_with_namespace || '!' || m.iid)) AS mr_refs
|
||
|
||
|
||
Also update human output labeling:
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
- style("Touches").bold(),
|
||
+ style("MRs").bold(),
|
||
|
||
|
||
(You still preserve “strength” via mr_refs and last_touch_at.)
|
||
|
||
5) Make outputs more actionable: add a canonical ref field (group/project!iid, group/project#iid)
|
||
Why
|
||
|
||
You already do this for Overlap (mr_refs). Doing the same for Workload and Active reduces friction for both humans and agents:
|
||
|
||
humans can copy/paste a single token
|
||
|
||
robots don’t need to stitch project_path + iid + prefix
|
||
|
||
Change (Workload structs + SQL)
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
pub struct WorkloadIssue {
|
||
pub iid: i64,
|
||
+ pub ref_: String,
|
||
pub title: String,
|
||
pub project_path: String,
|
||
pub updated_at: i64,
|
||
}
|
||
@@
|
||
pub struct WorkloadMr {
|
||
pub iid: i64,
|
||
+ pub ref_: String,
|
||
pub title: String,
|
||
pub draft: bool,
|
||
pub project_path: String,
|
||
@@
|
||
- let issues_sql =
|
||
- "SELECT i.iid, i.title, p.path_with_namespace, i.updated_at
|
||
+ let issues_sql =
|
||
+ "SELECT i.iid,
|
||
+ (p.path_with_namespace || '#' || i.iid) AS ref,
|
||
+ i.title, p.path_with_namespace, i.updated_at
|
||
@@
|
||
- iid: row.get(0)?,
|
||
- title: row.get(1)?,
|
||
- project_path: row.get(2)?,
|
||
- updated_at: row.get(3)?,
|
||
+ iid: row.get(0)?,
|
||
+ ref_: row.get(1)?,
|
||
+ title: row.get(2)?,
|
||
+ project_path: row.get(3)?,
|
||
+ updated_at: row.get(4)?,
|
||
})
|
||
@@
|
||
- let authored_sql =
|
||
- "SELECT m.iid, m.title, m.draft, p.path_with_namespace, m.updated_at
|
||
+ let authored_sql =
|
||
+ "SELECT m.iid,
|
||
+ (p.path_with_namespace || '!' || m.iid) AS ref,
|
||
+ m.title, m.draft, p.path_with_namespace, m.updated_at
|
||
@@
|
||
- iid: row.get(0)?,
|
||
- title: row.get(1)?,
|
||
- draft: row.get::<_, i32>(2)? != 0,
|
||
- project_path: row.get(3)?,
|
||
+ iid: row.get(0)?,
|
||
+ ref_: row.get(1)?,
|
||
+ title: row.get(2)?,
|
||
+ draft: row.get::<_, i32>(3)? != 0,
|
||
+ project_path: row.get(4)?,
|
||
author_username: None,
|
||
- updated_at: row.get(4)?,
|
||
+ updated_at: row.get(5)?,
|
||
})
|
||
|
||
|
||
Then use ref_ in human output + robot JSON.
|
||
|
||
6) Reviews mode: tolerate leading whitespace before **prefix**
|
||
Why
|
||
|
||
Many people write " **suggestion**: ...". Current LIKE '**%**%' misses that.
|
||
|
||
Change
|
||
|
||
Use ltrim(n.body) consistently:
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
- AND n.body LIKE '**%**%'
|
||
+ AND ltrim(n.body) LIKE '**%**%'
|
||
@@
|
||
- SUBSTR(n.body, 3, INSTR(SUBSTR(n.body, 3), '**') - 1) AS raw_prefix,
|
||
+ SUBSTR(ltrim(n.body), 3, INSTR(SUBSTR(ltrim(n.body), 3), '**') - 1) AS raw_prefix,
|
||
|
||
7) Add two small tests that catch the above regressions
|
||
Why
|
||
|
||
These are exactly the kind of issues that slip through without targeted tests.
|
||
|
||
diff
|
||
Copy code
|
||
diff --git a/plan.md b/plan.md
|
||
@@
|
||
#[test]
|
||
fn test_escape_like() {
|
||
@@
|
||
}
|
||
+
|
||
+ #[test]
|
||
+ fn test_build_path_query_exact_does_not_escape() {
|
||
+ // '_' must not be escaped for '='
|
||
+ let pq = build_path_query("README_with_underscore.md");
|
||
+ assert_eq!(pq.value, "README_with_underscore.md");
|
||
+ assert!(!pq.is_prefix);
|
||
+ }
|
||
+
|
||
+ #[test]
|
||
+ fn test_path_flag_dotless_root_file_is_exact() {
|
||
+ let pq = build_path_query("Makefile");
|
||
+ assert_eq!(pq.value, "Makefile");
|
||
+ assert!(!pq.is_prefix);
|
||
+ }
|
||
|
||
Summary of net effect
|
||
|
||
Correctness fixes: exact-path escaping bug; Active.note_count bug.
|
||
|
||
Perf fixes: global --active index; avoid broad note scans in Active.
|
||
|
||
Usefulness upgrades: coherent overlap “touch” metric; canonical refs everywhere; reviews prefix more robust.
|
||
|
||
If you want one extra “stretch” that still isn’t scope creep: add an unscoped warning line in human output when project_id == None (e.g., “Aggregated across projects; use -p to scope”) for Expert/Overlap/Active. That’s pure presentation, but prevents misinterpretation in multi-project DBs. |