Ideas catalog (docs/ideas/): 25 feature concept documents covering future lore capabilities including bottleneck detection, churn analysis, expert scoring, collaboration patterns, milestone risk, knowledge silos, and more. Each doc includes motivation, implementation sketch, data requirements, and dependencies on existing infrastructure. README.md provides an overview and SYSTEM-PROPOSAL.md presents the unified analytics vision. Plans (plans/): Time-decay expert scoring design with four rounds of review feedback exploring decay functions, scoring algebra, and integration points with the existing who-expert pipeline. Issue doc (docs/issues/001): Documents the timeline pipeline bug where EntityRef was missing project context, causing ambiguous cross-project references during the EXPAND stage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
96 lines
2.7 KiB
Markdown
96 lines
2.7 KiB
Markdown
# Similar Issues Finder
|
|
|
|
- **Command:** `lore similar <iid>`
|
|
- **Confidence:** 95%
|
|
- **Tier:** 1
|
|
- **Status:** proposed
|
|
- **Effort:** low — infrastructure exists, needs one new query path
|
|
|
|
## What
|
|
|
|
Given an issue IID, find the N most semantically similar issues using the existing
|
|
vector embeddings. Show similarity score and overlapping keywords.
|
|
|
|
Can also work with MRs: `lore similar --mr <iid>`.
|
|
|
|
## Why
|
|
|
|
Duplicate detection is a constant problem on active projects. "Is this bug already
|
|
filed?" becomes a one-liner. This is the most natural use of the embedding pipeline
|
|
and the feature people expect when they hear "semantic search."
|
|
|
|
## Data Required
|
|
|
|
All exists today:
|
|
- `documents` table (source_type, source_id, content_text)
|
|
- `embeddings` virtual table (768-dim vectors via sqlite-vec)
|
|
- `embedding_metadata` (document_hash for staleness check)
|
|
|
|
## Implementation Sketch
|
|
|
|
```
|
|
1. Resolve IID → issue.id → document.id (via source_type='issue', source_id)
|
|
2. Look up embedding vector(s) for that document
|
|
3. Query sqlite-vec for K nearest neighbors (K = limit * 2 for headroom)
|
|
4. Filter to source_type='issue' (or 'merge_request' if --include-mrs)
|
|
5. Exclude self
|
|
6. Rank by cosine similarity
|
|
7. Return top N with: iid, title, project, similarity_score, url
|
|
```
|
|
|
|
### SQL Core
|
|
|
|
```sql
|
|
-- Get the embedding for target document (chunk 0 = representative)
|
|
SELECT embedding FROM embeddings WHERE rowid = ?1 * 1000;
|
|
|
|
-- Find nearest neighbors
|
|
SELECT
|
|
rowid,
|
|
distance
|
|
FROM embeddings
|
|
WHERE embedding MATCH ?1
|
|
AND k = ?2
|
|
ORDER BY distance;
|
|
|
|
-- Resolve back to entities
|
|
SELECT d.source_type, d.source_id, d.title, d.url, i.iid, i.state
|
|
FROM documents d
|
|
JOIN issues i ON d.source_id = i.id AND d.source_type = 'issue'
|
|
WHERE d.id = ?;
|
|
```
|
|
|
|
## Robot Mode Output
|
|
|
|
```json
|
|
{
|
|
"ok": true,
|
|
"data": {
|
|
"query_issue": { "iid": 42, "title": "Login timeout on slow networks" },
|
|
"similar": [
|
|
{
|
|
"iid": 38,
|
|
"title": "Connection timeout in auth flow",
|
|
"project": "group/backend",
|
|
"similarity": 0.87,
|
|
"state": "closed",
|
|
"url": "https://gitlab.com/group/backend/-/issues/38"
|
|
}
|
|
]
|
|
},
|
|
"meta": { "elapsed_ms": 45, "candidates_scanned": 200 }
|
|
}
|
|
```
|
|
|
|
## Downsides
|
|
|
|
- Embedding quality depends on description quality; short issues may not match well
|
|
- Multi-chunk documents need aggregation strategy (use chunk 0 or average?)
|
|
- Requires embeddings to be generated first (`lore embed`)
|
|
|
|
## Extensions
|
|
|
|
- `lore similar --open-only` to filter to unresolved issues (duplicate triage)
|
|
- `lore similar --text "free text query"` to find issues similar to arbitrary text
|
|
- Batch mode: find all potential duplicate clusters across the entire database
|