gitlore/docs/ideas/similar-issues.md

# Similar Issues Finder

- **Command:** `lore similar <iid>`
- **Confidence:** 95%
- **Tier:** 1
- **Status:** proposed
- **Effort:** low — infrastructure exists, needs one new query path

## What

Given an issue IID, find the N most semantically similar issues using the existing
vector embeddings. Show similarity score and overlapping keywords.

Can also work with MRs: `lore similar --mr <iid>`.

## Why

Duplicate detection is a constant problem on active projects. "Is this bug already
filed?" becomes a one-liner. This is the most natural use of the embedding pipeline
and the feature people expect when they hear "semantic search."

## Data Required

All exists today:
- `documents` table (source_type, source_id, content_text)
- `embeddings` virtual table (768-dim vectors via sqlite-vec)
- `embedding_metadata` (document_hash for staleness check)

## Implementation Sketch

```
1. Resolve IID → issue.id → document.id (via source_type='issue', source_id)
2. Look up embedding vector(s) for that document
3. Query sqlite-vec for K nearest neighbors (K = limit * 2 for headroom)
4. Filter to source_type='issue' (or 'merge_request' if --include-mrs)
5. Exclude self
6. Rank by cosine similarity
7. Return top N with: iid, title, project, similarity_score, url
```

### SQL Core

```sql
-- Get the embedding for target document (chunk 0 = representative)
SELECT embedding FROM embeddings WHERE rowid = ?1 * 1000;

-- Find nearest neighbors
SELECT
    rowid,
    distance
FROM embeddings
WHERE embedding MATCH ?1
    AND k = ?2
ORDER BY distance;

-- Resolve back to entities
SELECT d.source_type, d.source_id, d.title, d.url, i.iid, i.state
FROM documents d
JOIN issues i ON d.source_id = i.id AND d.source_type = 'issue'
WHERE d.id = ?;
```

## Robot Mode Output

```json
{
  "ok": true,
  "data": {
    "query_issue": { "iid": 42, "title": "Login timeout on slow networks" },
    "similar": [
      {
        "iid": 38,
        "title": "Connection timeout in auth flow",
        "project": "group/backend",
        "similarity": 0.87,
        "state": "closed",
        "url": "https://gitlab.com/group/backend/-/issues/38"
      }
    ]
  },
  "meta": { "elapsed_ms": 45, "candidates_scanned": 200 }
}
```

## Downsides

- Embedding quality depends on description quality; short issues may not match well
- Multi-chunk documents need aggregation strategy (use chunk 0 or average?)
- Requires embeddings to be generated first (`lore embed`)

## Extensions

- `lore similar --open-only` to filter to unresolved issues (duplicate triage)
- `lore similar --text "free text query"` to find issues similar to arbitrary text
- Batch mode: find all potential duplicate clusters across the entire database