# Recurring Bug Pattern Detector

- **Command:** `lore recurring-patterns [--min-cluster <N>]`
- **Confidence:** 76%
- **Tier:** 3
- **Status:** proposed
- **Effort:** high — vector clustering, threshold tuning

## What

Cluster closed issues by embedding similarity. Identify clusters of 3+ issues that
are semantically similar — these represent recurring problems that need a systemic
fix rather than one-off patches.

## Why

Finding the same bug filed 5 different ways is one of the most impactful things you
can surface. This is a sophisticated use of the embedding pipeline that no competing
tool offers. It turns "we keep having auth issues" from a gut feeling into data.

## Data Required

All exists today:
- `documents` (source_type='issue', content_text)
- `embeddings` (768-dim vectors)
- `issues` (state='closed' for filtering)

## Implementation Sketch

```
1. Collect all embeddings for closed issue documents
2. For each issue, find K nearest neighbors (K=10)
3. Build adjacency graph: edge exists if similarity > threshold (e.g., 0.80)
4. Find connected components (simple DFS/BFS)
5. Filter to components with >= min-cluster members (default 3)
6. For each cluster:
   a. Extract common terms (TF-IDF or simple word frequency)
   b. Sort by recency (most recent issue first)
   c. Report cluster with: theme, member issues, time span
```

### Similarity Threshold Tuning

This is the critical parameter. Too low = noise, too high = misses.
- Start at 0.80 cosine similarity
- Expose as `--threshold` flag for user tuning
- Report cluster cohesion score for transparency

## Human Output

```
Recurring Patterns (3+ similar closed issues)

Cluster 1: "Authentication timeout errors" (5 issues, spanning 6 months)
  #89  Login timeout on slow networks           (closed 3d ago)
  #72  Auth flow hangs on cellular               (closed 2mo ago)
  #58  Token refresh timeout                     (closed 3mo ago)
  #45  SSO login timeout for remote users        (closed 5mo ago)
  #31  Connection timeout in auth middleware      (closed 6mo ago)
  Avg similarity: 0.87 | Suggested: systemic fix for auth timeout handling

Cluster 2: "Cache invalidation issues" (3 issues, spanning 2 months)
  #85  Stale cache after deploy                  (closed 2w ago)
  #77  Cache headers not updated                 (closed 1mo ago)
  #69  Dashboard shows old data after settings change (closed 2mo ago)
  Avg similarity: 0.82 | Suggested: review cache invalidation strategy
```

## Downsides

- Clustering quality depends on embedding quality and threshold tuning
- May produce false clusters (issues that mention similar terms but are different problems)
- Computationally expensive for large issue counts (N^2 comparisons)
- Need to handle multi-chunk documents (aggregate embeddings)

## Extensions

- `lore recurring-patterns --open` — find clusters in open issues (duplicates to merge)
- `lore recurring-patterns --cross-project` — patterns across repos
- Trend detection: are cluster sizes growing? (escalating problem)
- Export as report for engineering retrospectives