# Recurring Bug Pattern Detector - **Command:** `lore recurring-patterns [--min-cluster ]` - **Confidence:** 76% - **Tier:** 3 - **Status:** proposed - **Effort:** high — vector clustering, threshold tuning ## What Cluster closed issues by embedding similarity. Identify clusters of 3+ issues that are semantically similar — these represent recurring problems that need a systemic fix rather than one-off patches. ## Why Finding the same bug filed 5 different ways is one of the most impactful things you can surface. This is a sophisticated use of the embedding pipeline that no competing tool offers. It turns "we keep having auth issues" from a gut feeling into data. ## Data Required All exists today: - `documents` (source_type='issue', content_text) - `embeddings` (768-dim vectors) - `issues` (state='closed' for filtering) ## Implementation Sketch ``` 1. Collect all embeddings for closed issue documents 2. For each issue, find K nearest neighbors (K=10) 3. Build adjacency graph: edge exists if similarity > threshold (e.g., 0.80) 4. Find connected components (simple DFS/BFS) 5. Filter to components with >= min-cluster members (default 3) 6. For each cluster: a. Extract common terms (TF-IDF or simple word frequency) b. Sort by recency (most recent issue first) c. Report cluster with: theme, member issues, time span ``` ### Similarity Threshold Tuning This is the critical parameter. Too low = noise, too high = misses. - Start at 0.80 cosine similarity - Expose as `--threshold` flag for user tuning - Report cluster cohesion score for transparency ## Human Output ``` Recurring Patterns (3+ similar closed issues) Cluster 1: "Authentication timeout errors" (5 issues, spanning 6 months) #89 Login timeout on slow networks (closed 3d ago) #72 Auth flow hangs on cellular (closed 2mo ago) #58 Token refresh timeout (closed 3mo ago) #45 SSO login timeout for remote users (closed 5mo ago) #31 Connection timeout in auth middleware (closed 6mo ago) Avg similarity: 0.87 | Suggested: systemic fix for auth timeout handling Cluster 2: "Cache invalidation issues" (3 issues, spanning 2 months) #85 Stale cache after deploy (closed 2w ago) #77 Cache headers not updated (closed 1mo ago) #69 Dashboard shows old data after settings change (closed 2mo ago) Avg similarity: 0.82 | Suggested: review cache invalidation strategy ``` ## Downsides - Clustering quality depends on embedding quality and threshold tuning - May produce false clusters (issues that mention similar terms but are different problems) - Computationally expensive for large issue counts (N^2 comparisons) - Need to handle multi-chunk documents (aggregate embeddings) ## Extensions - `lore recurring-patterns --open` — find clusters in open issues (duplicates to merge) - `lore recurring-patterns --cross-project` — patterns across repos - Trend detection: are cluster sizes growing? (escalating problem) - Export as report for engineering retrospectives