Files

Taylor Eernisse 4185abe05d docs: add feature ideas catalog, time-decay scoring plan, and timeline issue doc

Ideas catalog (docs/ideas/): 25 feature concept documents covering future
lore capabilities including bottleneck detection, churn analysis, expert
scoring, collaboration patterns, milestone risk, knowledge silos, and more.
Each doc includes motivation, implementation sketch, data requirements, and
dependencies on existing infrastructure. README.md provides an overview and
SYSTEM-PROPOSAL.md presents the unified analytics vision.

Plans (plans/): Time-decay expert scoring design with four rounds of review
feedback exploring decay functions, scoring algebra, and integration points
with the existing who-expert pipeline.

Issue doc (docs/issues/001): Documents the timeline pipeline bug where
EntityRef was missing project context, causing ambiguous cross-project
references during the EXPAND stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-09 10:16:48 -05:00

3.0 KiB

Raw Blame History

Recurring Bug Pattern Detector

Command: lore recurring-patterns [--min-cluster <N>]
Confidence: 76%
Tier: 3
Status: proposed
Effort: high — vector clustering, threshold tuning

What

Cluster closed issues by embedding similarity. Identify clusters of 3+ issues that are semantically similar — these represent recurring problems that need a systemic fix rather than one-off patches.

Why

Finding the same bug filed 5 different ways is one of the most impactful things you can surface. This is a sophisticated use of the embedding pipeline that no competing tool offers. It turns "we keep having auth issues" from a gut feeling into data.

Data Required

All exists today:

documents (source_type='issue', content_text)
embeddings (768-dim vectors)
issues (state='closed' for filtering)

Implementation Sketch

1. Collect all embeddings for closed issue documents
2. For each issue, find K nearest neighbors (K=10)
3. Build adjacency graph: edge exists if similarity > threshold (e.g., 0.80)
4. Find connected components (simple DFS/BFS)
5. Filter to components with >= min-cluster members (default 3)
6. For each cluster:
   a. Extract common terms (TF-IDF or simple word frequency)
   b. Sort by recency (most recent issue first)
   c. Report cluster with: theme, member issues, time span

Similarity Threshold Tuning

This is the critical parameter. Too low = noise, too high = misses.

Start at 0.80 cosine similarity
Expose as --threshold flag for user tuning
Report cluster cohesion score for transparency

Human Output

Recurring Patterns (3+ similar closed issues)

Cluster 1: "Authentication timeout errors" (5 issues, spanning 6 months)
  #89  Login timeout on slow networks           (closed 3d ago)
  #72  Auth flow hangs on cellular               (closed 2mo ago)
  #58  Token refresh timeout                     (closed 3mo ago)
  #45  SSO login timeout for remote users        (closed 5mo ago)
  #31  Connection timeout in auth middleware      (closed 6mo ago)
  Avg similarity: 0.87 | Suggested: systemic fix for auth timeout handling

Cluster 2: "Cache invalidation issues" (3 issues, spanning 2 months)
  #85  Stale cache after deploy                  (closed 2w ago)
  #77  Cache headers not updated                 (closed 1mo ago)
  #69  Dashboard shows old data after settings change (closed 2mo ago)
  Avg similarity: 0.82 | Suggested: review cache invalidation strategy

Downsides

Clustering quality depends on embedding quality and threshold tuning
May produce false clusters (issues that mention similar terms but are different problems)
Computationally expensive for large issue counts (N^2 comparisons)
Need to handle multi-chunk documents (aggregate embeddings)

Extensions

lore recurring-patterns --open — find clusters in open issues (duplicates to merge)
lore recurring-patterns --cross-project — patterns across repos
Trend detection: are cluster sizes growing? (escalating problem)
Export as report for engineering retrospectives

3.0 KiB Raw Blame History