# Similar Issues Finder - **Command:** `lore similar ` - **Confidence:** 95% - **Tier:** 1 - **Status:** proposed - **Effort:** low — infrastructure exists, needs one new query path ## What Given an issue IID, find the N most semantically similar issues using the existing vector embeddings. Show similarity score and overlapping keywords. Can also work with MRs: `lore similar --mr `. ## Why Duplicate detection is a constant problem on active projects. "Is this bug already filed?" becomes a one-liner. This is the most natural use of the embedding pipeline and the feature people expect when they hear "semantic search." ## Data Required All exists today: - `documents` table (source_type, source_id, content_text) - `embeddings` virtual table (768-dim vectors via sqlite-vec) - `embedding_metadata` (document_hash for staleness check) ## Implementation Sketch ``` 1. Resolve IID → issue.id → document.id (via source_type='issue', source_id) 2. Look up embedding vector(s) for that document 3. Query sqlite-vec for K nearest neighbors (K = limit * 2 for headroom) 4. Filter to source_type='issue' (or 'merge_request' if --include-mrs) 5. Exclude self 6. Rank by cosine similarity 7. Return top N with: iid, title, project, similarity_score, url ``` ### SQL Core ```sql -- Get the embedding for target document (chunk 0 = representative) SELECT embedding FROM embeddings WHERE rowid = ?1 * 1000; -- Find nearest neighbors SELECT rowid, distance FROM embeddings WHERE embedding MATCH ?1 AND k = ?2 ORDER BY distance; -- Resolve back to entities SELECT d.source_type, d.source_id, d.title, d.url, i.iid, i.state FROM documents d JOIN issues i ON d.source_id = i.id AND d.source_type = 'issue' WHERE d.id = ?; ``` ## Robot Mode Output ```json { "ok": true, "data": { "query_issue": { "iid": 42, "title": "Login timeout on slow networks" }, "similar": [ { "iid": 38, "title": "Connection timeout in auth flow", "project": "group/backend", "similarity": 0.87, "state": "closed", "url": "https://gitlab.com/group/backend/-/issues/38" } ] }, "meta": { "elapsed_ms": 45, "candidates_scanned": 200 } } ``` ## Downsides - Embedding quality depends on description quality; short issues may not match well - Multi-chunk documents need aggregation strategy (use chunk 0 or average?) - Requires embeddings to be generated first (`lore embed`) ## Extensions - `lore similar --open-only` to filter to unresolved issues (duplicate triage) - `lore similar --text "free text query"` to find issues similar to arbitrary text - Batch mode: find all potential duplicate clusters across the entire database