docs: add feature ideas catalog, time-decay scoring plan, and timeline issue doc

Ideas catalog (docs/ideas/): 25 feature concept documents covering future
lore capabilities including bottleneck detection, churn analysis, expert
scoring, collaboration patterns, milestone risk, knowledge silos, and more.
Each doc includes motivation, implementation sketch, data requirements, and
dependencies on existing infrastructure. README.md provides an overview and
SYSTEM-PROPOSAL.md presents the unified analytics vision.

Plans (plans/): Time-decay expert scoring design with four rounds of review
feedback exploring decay functions, scoring algebra, and integration points
with the existing who-expert pipeline.

Issue doc (docs/issues/001): Documents the timeline pipeline bug where
EntityRef was missing project context, causing ambiguous cross-project
references during the EXPAND stage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Taylor Eernisse
2026-02-09 10:16:48 -05:00
parent d54f669c5e
commit 4185abe05d
32 changed files with 4170 additions and 0 deletions

View File

@@ -0,0 +1,95 @@
# Similar Issues Finder
- **Command:** `lore similar <iid>`
- **Confidence:** 95%
- **Tier:** 1
- **Status:** proposed
- **Effort:** low — infrastructure exists, needs one new query path
## What
Given an issue IID, find the N most semantically similar issues using the existing
vector embeddings. Show similarity score and overlapping keywords.
Can also work with MRs: `lore similar --mr <iid>`.
## Why
Duplicate detection is a constant problem on active projects. "Is this bug already
filed?" becomes a one-liner. This is the most natural use of the embedding pipeline
and the feature people expect when they hear "semantic search."
## Data Required
All exists today:
- `documents` table (source_type, source_id, content_text)
- `embeddings` virtual table (768-dim vectors via sqlite-vec)
- `embedding_metadata` (document_hash for staleness check)
## Implementation Sketch
```
1. Resolve IID → issue.id → document.id (via source_type='issue', source_id)
2. Look up embedding vector(s) for that document
3. Query sqlite-vec for K nearest neighbors (K = limit * 2 for headroom)
4. Filter to source_type='issue' (or 'merge_request' if --include-mrs)
5. Exclude self
6. Rank by cosine similarity
7. Return top N with: iid, title, project, similarity_score, url
```
### SQL Core
```sql
-- Get the embedding for target document (chunk 0 = representative)
SELECT embedding FROM embeddings WHERE rowid = ?1 * 1000;
-- Find nearest neighbors
SELECT
rowid,
distance
FROM embeddings
WHERE embedding MATCH ?1
AND k = ?2
ORDER BY distance;
-- Resolve back to entities
SELECT d.source_type, d.source_id, d.title, d.url, i.iid, i.state
FROM documents d
JOIN issues i ON d.source_id = i.id AND d.source_type = 'issue'
WHERE d.id = ?;
```
## Robot Mode Output
```json
{
"ok": true,
"data": {
"query_issue": { "iid": 42, "title": "Login timeout on slow networks" },
"similar": [
{
"iid": 38,
"title": "Connection timeout in auth flow",
"project": "group/backend",
"similarity": 0.87,
"state": "closed",
"url": "https://gitlab.com/group/backend/-/issues/38"
}
]
},
"meta": { "elapsed_ms": 45, "candidates_scanned": 200 }
}
```
## Downsides
- Embedding quality depends on description quality; short issues may not match well
- Multi-chunk documents need aggregation strategy (use chunk 0 or average?)
- Requires embeddings to be generated first (`lore embed`)
## Extensions
- `lore similar --open-only` to filter to unresolved issues (duplicate triage)
- `lore similar --text "free text query"` to find issues similar to arbitrary text
- Batch mode: find all potential duplicate clusters across the entire database