Ideas catalog (docs/ideas/): 25 feature concept documents covering future lore capabilities including bottleneck detection, churn analysis, expert scoring, collaboration patterns, milestone risk, knowledge silos, and more. Each doc includes motivation, implementation sketch, data requirements, and dependencies on existing infrastructure. README.md provides an overview and SYSTEM-PROPOSAL.md presents the unified analytics vision. Plans (plans/): Time-decay expert scoring design with four rounds of review feedback exploring decay functions, scoring algebra, and integration points with the existing who-expert pipeline. Issue doc (docs/issues/001): Documents the timeline pipeline bug where EntityRef was missing project context, causing ambiguous cross-project references during the EXPAND stage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3.0 KiB
3.0 KiB
Recurring Bug Pattern Detector
- Command:
lore recurring-patterns [--min-cluster <N>] - Confidence: 76%
- Tier: 3
- Status: proposed
- Effort: high — vector clustering, threshold tuning
What
Cluster closed issues by embedding similarity. Identify clusters of 3+ issues that are semantically similar — these represent recurring problems that need a systemic fix rather than one-off patches.
Why
Finding the same bug filed 5 different ways is one of the most impactful things you can surface. This is a sophisticated use of the embedding pipeline that no competing tool offers. It turns "we keep having auth issues" from a gut feeling into data.
Data Required
All exists today:
documents(source_type='issue', content_text)embeddings(768-dim vectors)issues(state='closed' for filtering)
Implementation Sketch
1. Collect all embeddings for closed issue documents
2. For each issue, find K nearest neighbors (K=10)
3. Build adjacency graph: edge exists if similarity > threshold (e.g., 0.80)
4. Find connected components (simple DFS/BFS)
5. Filter to components with >= min-cluster members (default 3)
6. For each cluster:
a. Extract common terms (TF-IDF or simple word frequency)
b. Sort by recency (most recent issue first)
c. Report cluster with: theme, member issues, time span
Similarity Threshold Tuning
This is the critical parameter. Too low = noise, too high = misses.
- Start at 0.80 cosine similarity
- Expose as
--thresholdflag for user tuning - Report cluster cohesion score for transparency
Human Output
Recurring Patterns (3+ similar closed issues)
Cluster 1: "Authentication timeout errors" (5 issues, spanning 6 months)
#89 Login timeout on slow networks (closed 3d ago)
#72 Auth flow hangs on cellular (closed 2mo ago)
#58 Token refresh timeout (closed 3mo ago)
#45 SSO login timeout for remote users (closed 5mo ago)
#31 Connection timeout in auth middleware (closed 6mo ago)
Avg similarity: 0.87 | Suggested: systemic fix for auth timeout handling
Cluster 2: "Cache invalidation issues" (3 issues, spanning 2 months)
#85 Stale cache after deploy (closed 2w ago)
#77 Cache headers not updated (closed 1mo ago)
#69 Dashboard shows old data after settings change (closed 2mo ago)
Avg similarity: 0.82 | Suggested: review cache invalidation strategy
Downsides
- Clustering quality depends on embedding quality and threshold tuning
- May produce false clusters (issues that mention similar terms but are different problems)
- Computationally expensive for large issue counts (N^2 comparisons)
- Need to handle multi-chunk documents (aggregate embeddings)
Extensions
lore recurring-patterns --open— find clusters in open issues (duplicates to merge)lore recurring-patterns --cross-project— patterns across repos- Trend detection: are cluster sizes growing? (escalating problem)
- Export as report for engineering retrospectives