Files
gitlore/docs/ideas/recurring-patterns.md
Taylor Eernisse 4185abe05d docs: add feature ideas catalog, time-decay scoring plan, and timeline issue doc
Ideas catalog (docs/ideas/): 25 feature concept documents covering future
lore capabilities including bottleneck detection, churn analysis, expert
scoring, collaboration patterns, milestone risk, knowledge silos, and more.
Each doc includes motivation, implementation sketch, data requirements, and
dependencies on existing infrastructure. README.md provides an overview and
SYSTEM-PROPOSAL.md presents the unified analytics vision.

Plans (plans/): Time-decay expert scoring design with four rounds of review
feedback exploring decay functions, scoring algebra, and integration points
with the existing who-expert pipeline.

Issue doc (docs/issues/001): Documents the timeline pipeline bug where
EntityRef was missing project context, causing ambiguous cross-project
references during the EXPAND stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 10:16:48 -05:00

3.0 KiB

Recurring Bug Pattern Detector

  • Command: lore recurring-patterns [--min-cluster <N>]
  • Confidence: 76%
  • Tier: 3
  • Status: proposed
  • Effort: high — vector clustering, threshold tuning

What

Cluster closed issues by embedding similarity. Identify clusters of 3+ issues that are semantically similar — these represent recurring problems that need a systemic fix rather than one-off patches.

Why

Finding the same bug filed 5 different ways is one of the most impactful things you can surface. This is a sophisticated use of the embedding pipeline that no competing tool offers. It turns "we keep having auth issues" from a gut feeling into data.

Data Required

All exists today:

  • documents (source_type='issue', content_text)
  • embeddings (768-dim vectors)
  • issues (state='closed' for filtering)

Implementation Sketch

1. Collect all embeddings for closed issue documents
2. For each issue, find K nearest neighbors (K=10)
3. Build adjacency graph: edge exists if similarity > threshold (e.g., 0.80)
4. Find connected components (simple DFS/BFS)
5. Filter to components with >= min-cluster members (default 3)
6. For each cluster:
   a. Extract common terms (TF-IDF or simple word frequency)
   b. Sort by recency (most recent issue first)
   c. Report cluster with: theme, member issues, time span

Similarity Threshold Tuning

This is the critical parameter. Too low = noise, too high = misses.

  • Start at 0.80 cosine similarity
  • Expose as --threshold flag for user tuning
  • Report cluster cohesion score for transparency

Human Output

Recurring Patterns (3+ similar closed issues)

Cluster 1: "Authentication timeout errors" (5 issues, spanning 6 months)
  #89  Login timeout on slow networks           (closed 3d ago)
  #72  Auth flow hangs on cellular               (closed 2mo ago)
  #58  Token refresh timeout                     (closed 3mo ago)
  #45  SSO login timeout for remote users        (closed 5mo ago)
  #31  Connection timeout in auth middleware      (closed 6mo ago)
  Avg similarity: 0.87 | Suggested: systemic fix for auth timeout handling

Cluster 2: "Cache invalidation issues" (3 issues, spanning 2 months)
  #85  Stale cache after deploy                  (closed 2w ago)
  #77  Cache headers not updated                 (closed 1mo ago)
  #69  Dashboard shows old data after settings change (closed 2mo ago)
  Avg similarity: 0.82 | Suggested: review cache invalidation strategy

Downsides

  • Clustering quality depends on embedding quality and threshold tuning
  • May produce false clusters (issues that mention similar terms but are different problems)
  • Computationally expensive for large issue counts (N^2 comparisons)
  • Need to handle multi-chunk documents (aggregate embeddings)

Extensions

  • lore recurring-patterns --open — find clusters in open issues (duplicates to merge)
  • lore recurring-patterns --cross-project — patterns across repos
  • Trend detection: are cluster sizes growing? (escalating problem)
  • Export as report for engineering retrospectives