docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan

AGENTS.md: Comprehensive rewrite adding file deletion safeguards,
destructive git command protocol, Rust toolchain conventions, code
editing discipline rules, compiler check requirements, TDD mandate,
MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass
tool documentation, and session completion workflow.

README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration,
project resolution cascading match logic, lore health and robot-docs
commands, exit codes 17 (not found) and 18 (ambiguous match),
--color/--quiet global flags, dirty_sources and
pending_discussion_fetches tables, and version command git hash output.

docs/embedding-pipeline-hardening.md: Detailed spec covering the three
problems from the chunk size reduction (broken --full wiring, mixed
chunk sizes in vector space, static dedup multiplier) with decision
records, implementation plan, and acceptance criteria.

docs/phase-b-temporal-intelligence.md: Draft planning document for
transforming gitlore from a search engine into a temporal code
intelligence system by ingesting structured event data from GitLab.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Taylor Eernisse
2026-02-03 09:35:51 -05:00
parent f560e6bc00
commit a417640faa
4 changed files with 1891 additions and 4 deletions

View File

@@ -0,0 +1,308 @@
# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
> **Status:** Proposed
> **Date:** 2026-02-02
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
## Problem Statement
Three issues stem from the chunk size reduction:
1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
## Decision Record
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
---
## Changes
### Change 1: Wire `--full` flag through to embed
**Files:**
- `src/main.rs` (line 1116)
- `src/cli/commands/sync.rs` (line 105)
**main.rs `handle_embed`** (line 1116):
```rust
// BEFORE:
let result = run_embed(&config, retry_failed).await?;
// AFTER:
let result = run_embed_full(&config, args.full, retry_failed).await?;
```
Update the import at top of main.rs from `run_embed` to `run_embed_full`.
**sync.rs `run_sync`** (line 105):
```rust
// BEFORE:
match run_embed(config, false).await {
// AFTER:
match run_embed_full(config, options.full, false).await {
```
Update the import at line 11 from `run_embed` to `run_embed_full`.
**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
Final signature:
```rust
pub async fn run_embed(
config: &Config,
full: bool,
retry_failed: bool,
) -> Result<EmbedCommandResult>
```
---
### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
**New file:** `migrations/010_chunk_config.sql`
```sql
-- Migration 010: Chunk config tracking
-- Schema version: 10
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
UPDATE schema_version SET version = 10
WHERE version = (SELECT MAX(version) FROM schema_version);
-- Or if using INSERT pattern:
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
```
Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
---
### Change 3: Store `chunk_max_bytes` when writing embeddings
**File:** `src/embedding/pipeline.rs`
**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
```rust
// Add import at top:
use crate::embedding::chunking::CHUNK_MAX_BYTES;
// In store_embedding, update SQL:
conn.execute(
"INSERT OR REPLACE INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
],
)?;
```
**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
```rust
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
attempt_count = embedding_metadata.attempt_count + 1,
last_error = ?8,
last_attempt_at = ?7,
chunk_max_bytes = ?9",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
],
)?;
```
---
### Change 4: Detect chunk config drift in change detector
**File:** `src/embedding/change_detector.rs`
Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
```rust
use crate::embedding::chunking::CHUNK_MAX_BYTES;
pub fn find_pending_documents(
conn: &Connection,
page_size: usize,
last_id: i64,
) -> Result<Vec<PendingDocument>> {
let sql = r#"
SELECT d.id, d.content_text, d.content_hash
FROM documents d
WHERE d.id > ?1
AND (
-- Case 1: No embedding metadata (new document)
NOT EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
)
-- Case 2: Document content changed
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND em.document_hash != d.content_hash
)
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
)
)
ORDER BY d.id
LIMIT ?2
"#;
let mut stmt = conn.prepare(sql)?;
let rows = stmt
.query_map(
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|row| {
Ok(PendingDocument {
document_id: row.get(0)?,
content_text: row.get(1)?,
content_hash: row.get(2)?,
})
},
)?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(rows)
}
```
Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
---
### Change 5: Adaptive dedup multiplier in vector search
**File:** `src/search/vector.rs`
Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
```rust
/// Query the max chunks any single document has in the embedding table.
/// Returns the max chunk count, or a default floor if no data exists.
fn max_chunks_per_document(conn: &Connection) -> i64 {
conn.query_row(
"SELECT COALESCE(MAX(cnt), 1) FROM (
SELECT COUNT(*) as cnt FROM embedding_metadata
WHERE last_error IS NULL
GROUP BY document_id
)",
[],
|row| row.get(0),
)
.unwrap_or(1)
}
pub fn search_vector(
conn: &Connection,
query_embedding: &[f32],
limit: usize,
) -> Result<Vec<VectorResult>> {
if query_embedding.is_empty() || limit == 0 {
return Ok(Vec::new());
}
let embedding_bytes: Vec<u8> = query_embedding
.iter()
.flat_map(|f| f.to_le_bytes())
.collect();
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
// The 1.5x safety margin handles clustering in KNN results
let max_chunks = max_chunks_per_document(conn);
let multiplier = (max_chunks as usize * 3 / 2).max(8);
let k = limit * multiplier;
// ... rest unchanged ...
}
```
**Why `max_chunks * 1.5` with floor of 8**:
- `max_chunks` is the worst case for a single document dominating results
- `* 1.5` adds margin for multiple clustered documents
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
- This is a single aggregate query on an indexed column -- sub-millisecond
---
### Change 6: Update chunk_ids.rs comment
**File:** `src/embedding/chunk_ids.rs` (line 1-3)
Update the comment to reflect current reality:
```rust
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
```
---
## Files Modified (Summary)
| File | Change |
|------|--------|
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
| `src/embedding/chunk_ids.rs` | Comment update only |
## Verification
1. **Compile check**: `cargo build` -- no errors
2. **Unit tests**: `cargo test` -- all existing tests pass
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
## Decision Record (for future reference)
**Date:** 2026-02-02
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
**Downstream consequences identified:**
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
4. Mixed chunk sizes degrade search -- fixed with config drift detection
5. Ollama API call volume increases proportionally -- acceptable for local model
**Rejected alternatives:**
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.

View File

@@ -0,0 +1,951 @@
# Phase B: Temporal Intelligence Foundation
> **Status:** Draft
> **Prerequisite:** CP3 Gates B+C complete (working search + sync pipeline)
> **Goal:** Transform gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab and exposing temporal queries that answer "why" and "when" questions about project history.
---
## Motivation
gitlore currently stores **snapshots** — the latest state of each issue, MR, and discussion. But temporal queries need **change history**. When an issue's labels change from `priority::low` to `priority::critical`, the current schema overwrites the label junction. The transition is lost.
GitLab issues, MRs, and discussions contain the raw ingredients for temporal intelligence: state transitions, label mutations, assignee changes, cross-references between entities, and decision rationale in discussions. What's missing is a structured temporal index that makes these ingredients queryable.
### The Problem This Solves
Today, when an AI agent or developer asks "Why did the team switch from REST to GraphQL?" or "What happened with the auth migration?", the answer is scattered across paginated API responses with no temporal index, no cross-referencing, and no semantic layer. Reconstructing a decision timeline manually takes 20+ minutes of clicking through GitLab's UI. This phase makes it take 2 seconds.
### Forcing Function
This phase is designed around one concrete question: **"What happened with X?"** — where X is any keyword, feature name, or initiative. If `lore timeline "auth migration"` can produce a useful, chronologically-ordered narrative of all related events across issues, MRs, and discussions, the architecture is validated. If it can't, we learn what's missing before investing in deeper temporal features.
---
## Executive Summary (Gated Milestones)
Five gates, each independently verifiable and shippable:
**Gate 1 (Resource Events Ingestion):** Structured event data from GitLab APIs → local event tables
**Gate 2 (Cross-Reference Extraction):** Entity relationship graph from structured APIs + system note parsing
**Gate 3 (Decision Timeline):** `lore timeline` command — keyword-driven chronological narrative
**Gate 4 (File Decision History):** `lore file-history` command — MR-to-file linking + scoped timelines
**Gate 5 (Code Trace):** `lore trace` command — file:line → commit → MR → issue → rationale chain
### Key Design Decisions
- **Structured APIs over text parsing.** GitLab provides Resource Events APIs (`resource_state_events`, `resource_label_events`, `resource_milestone_events`) that return clean JSON. These are the primary data source for temporal events. System note parsing is a fallback for events without structured APIs (assignee changes, cross-references).
- **Dependent resource pattern.** Resource events are fetched per-entity, triggered by the existing dirty source tracking. Same architecture as discussion fetching — queue-based, resumable, incremental.
- **Opt-in event ingestion.** New config flag `sync.fetchResourceEvents` (default `true`) controls whether the sync pipeline fetches event data. Users who don't need temporal features skip the additional API calls.
- **Application-level graph traversal.** Cross-reference expansion uses BFS in Rust, not recursive SQL CTEs. Capped at configurable depth (default 1) for predictable performance.
- **Evolutionary library extraction.** New commands are built with typed return structs from day one. Old commands are not retrofitted until a concrete consumer (MCP server, web UI) requires it.
- **Phase A fields cherry-picked as needed.** `merge_commit_sha` and `squash_commit_sha` are added in this phase's migration. Remaining Phase A fields are handled in their own migration later.
### Scope Boundaries
**In scope:**
- Batch temporal queries over historical data
- Structured event ingestion from GitLab APIs
- Cross-reference graph construction
- CLI commands with robot mode JSON output
**Out of scope (future phases):**
- Real-time monitoring / notifications ("alert me when my code changes")
- MCP server (Phase C — consumes the library API this phase produces)
- Web UI (Phase D — consumes the same library API)
- Pattern evolution / cross-project trend detection (Phase C)
- Library extraction refactor (happens organically as new commands are added)
---
## Gate 1: Resource Events Ingestion
### 1.1 Rationale: Why Not Parse System Notes?
The original approach was to parse system note body text with regex to extract state changes and label mutations. Research revealed this is the wrong approach:
1. **Structured APIs exist.** GitLab's Resource Events APIs return clean JSON with explicit `action`, `state`, and `label` fields. Available on all tiers (Free, Premium, Ultimate).
2. **System notes are localized.** A French GitLab instance says `"ajouté l'étiquette ~bug"` — regex breaks for non-English instances.
3. **Label events aren't in the Notes API.** Per [GitLab Issue #24661](https://gitlab.com/gitlab-org/gitlab/-/issues/24661), label change system notes are not returned by the Notes API. The Resource Label Events API is the only reliable source.
4. **No versioned format spec.** System note text has changed across GitLab 14.x17.x with no documentation of format changes.
System note parsing is still used for events without structured APIs (see Gate 2), but with the explicit understanding that it's best-effort and fragile for non-English instances.
### 1.2 Schema (Migration 010)
**File:** `migrations/010_resource_events.sql`
```sql
-- State change events (opened, closed, reopened, merged, locked)
-- Source: GET /projects/:id/issues/:iid/resource_state_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_state_events
CREATE TABLE resource_state_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
state TEXT NOT NULL, -- 'opened' | 'closed' | 'reopened' | 'merged' | 'locked'
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
-- "closed by MR" link: structured by GitLab, not parsed from text
source_merge_request_id INTEGER, -- GitLab's MR iid that caused this state change
source_commit TEXT, -- commit SHA that caused this state change
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_state_events_issue ON resource_state_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_state_events_mr ON resource_state_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_state_events_created ON resource_state_events(created_at);
-- Label change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_label_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_label_events
CREATE TABLE resource_label_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
label_name TEXT NOT NULL,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_label_events_issue ON resource_label_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_label_events_mr ON resource_label_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_label_events_created ON resource_label_events(created_at);
CREATE INDEX idx_label_events_label ON resource_label_events(label_name);
-- Milestone change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_milestone_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_milestone_events
CREATE TABLE resource_milestone_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
milestone_title TEXT NOT NULL,
milestone_id INTEGER,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_milestone_events_issue ON resource_milestone_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_milestone_events_mr ON resource_milestone_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_milestone_events_created ON resource_milestone_events(created_at);
```
### 1.3 Config Extension
**File:** `src/core/config.rs`
Add to `SyncConfig`:
```rust
/// Fetch resource events (state, label, milestone changes) during sync.
/// Increases API calls but enables temporal queries (lore timeline, etc.).
/// Default: true
#[serde(default = "default_true")]
pub fetch_resource_events: bool,
```
**Config file example:**
```json
{
"sync": {
"fetchResourceEvents": true
}
}
```
### 1.4 GitLab API Client
**New endpoints in `src/gitlab/client.rs`:**
```
GET /projects/:id/issues/:iid/resource_state_events?per_page=100
GET /projects/:id/issues/:iid/resource_label_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_state_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_label_events?per_page=100
GET /projects/:id/issues/:iid/resource_milestone_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_milestone_events?per_page=100
```
All endpoints use standard pagination. Fetch all pages per entity.
**New serde types in `src/gitlab/types.rs`:**
```rust
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabStateEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String, // "Issue" | "MergeRequest"
pub resource_id: i64,
pub state: String, // "opened" | "closed" | "reopened" | "merged" | "locked"
pub source_commit: Option<String>,
pub source_merge_request: Option<GitLabMergeRequestRef>,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabLabelEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub label: GitLabLabelRef,
pub action: String, // "add" | "remove"
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabMilestoneEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub milestone: GitLabMilestoneRef,
pub action: String, // "add" | "remove"
}
```
### 1.5 Ingestion Pipeline
**Architecture:** Generic dependent-fetch queue, generalizing the `pending_discussion_fetches` pattern. A single queue table serves all dependent resource types across Gates 1, 2, and 4, avoiding schema churn as new fetch types are added.
**New queue table (in migration 010):**
```sql
-- Generic queue for all dependent resource fetches (events, closes_issues, diffs)
-- Replaces per-type queue tables with a unified job model
CREATE TABLE pending_dependent_fetches (
id INTEGER PRIMARY KEY,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
entity_type TEXT NOT NULL CHECK (entity_type IN ('issue', 'merge_request')),
entity_iid INTEGER NOT NULL,
entity_local_id INTEGER NOT NULL,
job_type TEXT NOT NULL CHECK (job_type IN (
'resource_events', -- Gate 1: state + label + milestone events
'mr_closes_issues', -- Gate 2: closes_issues API
'mr_diffs' -- Gate 4: MR file changes
)),
payload_json TEXT, -- job-specific params, e.g. {"event_types":["state","label","milestone"]}
enqueued_at INTEGER NOT NULL,
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
next_retry_at INTEGER,
locked_at INTEGER, -- crash recovery: NULL = available, non-NULL = in progress
UNIQUE(project_id, entity_type, entity_iid, job_type)
);
```
The `locked_at` column provides crash recovery: if a sync process crashes mid-drain, stale locks (older than 5 minutes) are automatically reclaimed on the next `lore sync` run. This is intentionally minimal — full job leasing with `locked_by` and lease expiration is unnecessary for a single-process CLI tool.
**Flow:**
1. During issue/MR ingestion, when an entity is upserted (new or updated), enqueue jobs in `pending_dependent_fetches`:
- For all entities: `job_type = 'resource_events'` (when `fetchResourceEvents` is true)
- For MRs: `job_type = 'mr_closes_issues'` (always, for Gate 2)
- For MRs: `job_type = 'mr_diffs'` (when `fetchMrFileChanges` is true, for Gate 4)
2. After primary ingestion completes, drain the dependent fetch queue:
- Claim jobs: `UPDATE ... SET locked_at = now WHERE locked_at IS NULL AND (next_retry_at IS NULL OR next_retry_at <= now)`
- For each job, dispatch by `job_type` to the appropriate fetcher
- On success: DELETE the job row
- On transient failure: increment `attempts`, set `next_retry_at` with exponential backoff, clear `locked_at`
3. `lore sync` drains dependent jobs after ingestion + discussion fetch steps.
**Incremental behavior:** Only entities that changed since last sync are enqueued. On `--full` sync, all entities are re-enqueued.
### 1.6 API Call Budget
Per entity: 3 API calls (state + label + milestone) for issues, 3 for MRs.
| Scenario | Entities | API Calls | Time at 2k req/min |
|----------|----------|-----------|---------------------|
| Initial sync, 500 issues + 200 MRs | 700 | 2,100 | ~1 min |
| Initial sync, 2,000 issues + 1,000 MRs | 3,000 | 9,000 | ~4.5 min |
| Incremental sync, 20 changed entities | 20 | 60 | <2 sec |
Acceptable for initial sync. Incremental sync adds negligible overhead.
**Optimization (future):** If milestone events prove low-value, make them opt-in to reduce calls by 1/3.
### 1.7 Acceptance Criteria
- [ ] Migration 010 creates all three event tables + generic dependent fetch queue
- [ ] `lore sync` fetches resource events for changed entities when `fetchResourceEvents` is true
- [ ] `lore sync --no-events` skips event fetching
- [ ] Event fetch failures are queued for retry with exponential backoff
- [ ] Stale locks (crashed sync) automatically reclaimed on next run
- [ ] `lore count events` shows event counts by type
- [ ] `lore stats --check` validates event table referential integrity
- [ ] `lore stats --check` validates dependent job queue health (no stuck locks, retryable jobs visible)
- [ ] Robot mode JSON for all new commands
---
## Gate 2: Cross-Reference Extraction
### 2.1 Rationale
Temporal queries need to follow links between entities: "MR !567 closed issue #234", "issue #234 mentioned in MR !567", "#299 was opened as a follow-up to !567". These relationships are captured in two places:
1. **Structured API:** `GET /projects/:id/merge_requests/:iid/closes_issues` returns issues that close when the MR merges. Also, `resource_state_events` includes `source_merge_request_id` for "closed by MR" events.
2. **System notes:** Cross-references like "mentioned in !456" and "closed by !789" appear in system note body text.
### 2.2 Schema (in Migration 010)
```sql
-- Cross-references between entities
-- Populated from: closes_issues API, state events, system note parsing
--
-- Directionality convention:
-- source = the entity where the reference was *observed* (contains the note, or is the MR in closes_issues)
-- target = the entity being *referenced* (the issue closed, the MR mentioned)
-- This is consistent across all source_methods and enables predictable BFS traversal.
--
-- Unresolved references: when a cross-reference points to an entity in a project
-- that isn't synced locally, target_entity_id is NULL but target_project_path and
-- target_entity_iid are populated. This preserves valuable edges rather than
-- silently dropping them. Timeline output marks these as "[external]".
CREATE TABLE entity_references (
id INTEGER PRIMARY KEY,
source_entity_type TEXT NOT NULL CHECK (source_entity_type IN ('issue', 'merge_request')),
source_entity_id INTEGER NOT NULL, -- local DB id
target_entity_type TEXT NOT NULL CHECK (target_entity_type IN ('issue', 'merge_request')),
target_entity_id INTEGER, -- local DB id (NULL when target is unresolved/external)
target_project_path TEXT, -- e.g. "group/other-repo" (populated for cross-project refs)
target_entity_iid INTEGER, -- GitLab iid (populated when target_entity_id is NULL)
reference_type TEXT NOT NULL, -- 'closes' | 'mentioned' | 'related'
source_method TEXT NOT NULL, -- 'api_closes_issues' | 'api_state_event' | 'system_note_parse'
created_at INTEGER, -- when the reference was created (if known)
UNIQUE(source_entity_type, source_entity_id, target_entity_type,
COALESCE(target_entity_id, -1), COALESCE(target_project_path, ''),
COALESCE(target_entity_iid, -1), reference_type)
);
CREATE INDEX idx_refs_source ON entity_references(source_entity_type, source_entity_id);
CREATE INDEX idx_refs_target ON entity_references(target_entity_type, target_entity_id)
WHERE target_entity_id IS NOT NULL;
CREATE INDEX idx_refs_unresolved ON entity_references(target_project_path, target_entity_iid)
WHERE target_entity_id IS NULL;
```
### 2.3 Population Strategy
**Tier 1 — Structured APIs (reliable):**
1. **`closes_issues` endpoint:** After MR ingestion, fetch `GET /projects/:id/merge_requests/:iid/closes_issues`. Insert `reference_type = 'closes'`, `source_method = 'api_closes_issues'`. Source = MR, target = issue.
2. **State events:** When `resource_state_events` contains `source_merge_request_id`, insert `reference_type = 'closes'`, `source_method = 'api_state_event'`. Source = MR (referenced by iid), target = issue (that received the state change).
**Tier 2 — System note parsing (best-effort):**
Parse system notes where `is_system = 1` for cross-reference patterns.
**Directionality rule:** Source = entity containing the system note. Target = entity referenced by the note text. This is consistent with Tier 1's convention.
```
mentioned in !{iid}
mentioned in #{iid}
mentioned in {group}/{project}!{iid}
mentioned in {group}/{project}#{iid}
closed by !{iid}
closed by #{iid}
```
**Cross-project references:** When a system note references `{group}/{project}#{iid}` and the target project is not synced locally, store with `target_entity_id = NULL`, `target_project_path = '{group}/{project}'`, `target_entity_iid = {iid}`. These unresolved references are still valuable for timeline narratives — they indicate external dependencies and decision context even when we can't traverse further.
Insert with `source_method = 'system_note_parse'`. Accept that:
- This breaks on non-English GitLab instances
- Format may vary across GitLab versions
- Log parse failures at `debug` level for monitoring
**Tier 3 — Description/body parsing (deferred):**
Issue and MR descriptions often contain `#123` or `!456` references. Parsing these is lower confidence (mentions != relationships) and is deferred to a future iteration.
### 2.4 Ingestion Flow
The `closes_issues` fetch uses the generic dependent fetch queue (`job_type = 'mr_closes_issues'`):
- After MR ingestion, a `mr_closes_issues` job is enqueued alongside `resource_events` jobs
- One additional API call per MR: `GET /projects/:id/merge_requests/:iid/closes_issues`
- Cross-reference parsing from system notes runs as a local post-processing step (no API calls) after all dependent fetches complete
### 2.5 Acceptance Criteria
- [ ] `entity_references` table populated from `closes_issues` API for all synced MRs
- [ ] `entity_references` table populated from `resource_state_events` where `source_merge_request_id` is present
- [ ] System notes parsed for cross-reference patterns (English instances)
- [ ] Cross-project references stored as unresolved when target project is not synced
- [ ] `source_method` column tracks provenance of each reference
- [ ] References are deduplicated (same relationship from multiple sources stored once)
- [ ] Timeline JSON includes expansion provenance (`via`) for all expanded entities
---
## Gate 3: Decision Timeline (`lore timeline`)
### 3.1 Command Design
```bash
# Basic: keyword-driven timeline
lore timeline "auth migration"
# Scoped to project
lore timeline "auth migration" -p group/repo
# Limit date range
lore timeline "auth migration" --since 6m
lore timeline "auth migration" --since 2024-01-01
# Control cross-reference expansion depth
lore timeline "auth migration" --depth 0 # No expansion (matched entities only)
lore timeline "auth migration" --depth 1 # Follow direct references (default)
lore timeline "auth migration" --depth 2 # Two hops
# Control which edge types are followed during expansion
lore timeline "auth migration" --expand-mentions # Also follow 'mentioned' edges (off by default)
# Default expansion follows 'closes' and 'related' edges only.
# 'mentioned' edges are excluded by default because they have high fan-out
# and often connect tangentially related entities.
# Limit results
lore timeline "auth migration" -n 50
# Robot mode
lore -J timeline "auth migration"
```
### 3.2 Query Flow
```
1. SEED: FTS5 keyword search → matched document IDs (issues, MRs, and notes/discussions)
2. HYDRATE:
- Map document IDs → source entities (issues, MRs)
- Collect top matched notes as evidence candidates (bounded, default top 10)
These are the actual decision-bearing comments that answer "why"
3. EXPAND: Follow entity_references (BFS, depth-limited)
→ Discover related entities not matched by keywords
→ Default: follow 'closes' + 'related' edges; skip 'mentioned' unless --expand-mentions
→ Unresolved (external) references included in output but not traversed further
4. COLLECT EVENTS: For all entities (seed + expanded):
- Entity creation (created_at from issues/merge_requests)
- State changes (resource_state_events)
- Label changes (resource_label_events)
- Milestone changes (resource_milestone_events)
- Evidence notes: top FTS5-matched notes as discrete events (snippet + author + url)
- Merge events (merged_at from merge_requests)
5. INTERLEAVE: Sort all events chronologically
6. RENDER: Format as timeline (human or JSON)
```
**Why evidence notes instead of "discussion activity summarized":** The forcing function is "What happened with X?" A timeline entry that says "3 new comments" doesn't answer *why* — it answers *how many*. By including the top FTS5-matched notes as first-class timeline events, the timeline surfaces the actual decision rationale, code review feedback, and architectural reasoning that motivated changes. This uses the existing search infrastructure (CP3) with no new indexing required.
### 3.3 Event Model
The timeline doesn't store a separate unified event table. Instead, it queries across the existing tables at read time and produces a virtual event stream:
```rust
pub struct TimelineEvent {
pub timestamp: i64, // ms epoch
pub entity_type: String, // "issue" | "merge_request" | "discussion"
pub entity_iid: i64,
pub project_path: String,
pub event_type: TimelineEventType,
pub summary: String, // human-readable one-liner
pub actor: Option<String>, // username
pub url: Option<String>,
pub is_seed: bool, // matched by keyword (vs. expanded via reference)
}
pub enum TimelineEventType {
Created, // entity opened/created
StateChanged { state: String }, // closed, reopened, merged, locked
LabelAdded { label: String },
LabelRemoved { label: String },
MilestoneSet { milestone: String },
MilestoneRemoved { milestone: String },
Merged,
NoteEvidence { // FTS5-matched note surfacing decision rationale
note_id: i64,
snippet: String, // first ~200 chars of the matching note body
discussion_id: Option<i64>,
},
CrossReferenced { target: String },
}
```
### 3.4 Human Output Format
```
lore timeline "auth migration"
Timeline: "auth migration" (12 events across 4 entities)
───────────────────────────────────────────────────────
2024-03-15 CREATED #234 Migrate to OAuth2 @alice
Labels: ~auth, ~breaking-change
2024-03-18 CREATED !567 feat: add OAuth2 provider @bob
References: #234
2024-03-20 NOTE #234 "Should we support SAML too? I think @charlie
we should stick with OAuth2 for now..."
2024-03-22 LABEL !567 added ~security-review @alice
2024-03-24 NOTE !567 [src/auth/oauth.rs:45] @dave
"Consider refresh token rotation to
prevent session fixation attacks"
2024-03-25 MERGED !567 feat: add OAuth2 provider @alice
2024-03-26 CLOSED #234 closed by !567 @alice
2024-03-28 CREATED #299 OAuth2 login fails for SSO users @dave [expanded]
(via !567, closes)
───────────────────────────────────────────────────────
Seed entities: #234, !567 | Expanded: #299 (depth 1, via !567)
```
Entities discovered via cross-reference expansion are marked `[expanded]` with a compact provenance note showing which seed entity and edge type led to their discovery.
Evidence notes (`NOTE` events) show the first ~200 characters of FTS5-matched note bodies. These are the actual decision-bearing comments that answer "why" — not just activity counts.
### 3.5 Robot Mode JSON
```json
{
"ok": true,
"data": {
"query": "auth migration",
"event_count": 12,
"seed_entities": [
{ "type": "issue", "iid": 234, "project": "group/repo" },
{ "type": "merge_request", "iid": 567, "project": "group/repo" }
],
"expanded_entities": [
{
"type": "issue",
"iid": 299,
"project": "group/repo",
"depth": 1,
"via": {
"from": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"reference_type": "closes",
"source_method": "api_closes_issues"
}
}
],
"unresolved_references": [
{
"source": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"target_project": "group/other-repo",
"target_type": "issue",
"target_iid": 42,
"reference_type": "mentioned"
}
],
"events": [
{
"timestamp": "2024-03-15T10:00:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "created",
"summary": "Migrate to OAuth2",
"actor": "alice",
"url": "https://gitlab.com/group/repo/-/issues/234",
"is_seed": true,
"details": {
"labels": ["auth", "breaking-change"]
}
},
{
"timestamp": "2024-03-20T14:30:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "note_evidence",
"summary": "Should we support SAML too? I think we should stick with OAuth2 for now...",
"actor": "charlie",
"url": "https://gitlab.com/group/repo/-/issues/234#note_12345",
"is_seed": true,
"details": {
"note_id": 12345,
"snippet": "Should we support SAML too? I think we should stick with OAuth2 for now..."
}
}
]
},
"meta": {
"search_mode": "lexical",
"expansion_depth": 1,
"expand_mentions": false,
"total_entities": 3,
"total_events": 12,
"evidence_notes_included": 4,
"unresolved_references": 1
}
}
```
### 3.6 Acceptance Criteria
- [ ] `lore timeline <query>` returns chronologically ordered events
- [ ] Seed entities found via FTS5 keyword search (issues, MRs, and notes)
- [ ] State, label, and milestone events interleaved from resource event tables
- [ ] Entity creation and merge events included
- [ ] Evidence-bearing notes included as `note_evidence` events (top FTS5 matches, bounded default 10)
- [ ] Cross-reference expansion follows `entity_references` to configurable depth
- [ ] Default expansion follows `closes` + `related` edges; `--expand-mentions` adds `mentioned` edges
- [ ] `--depth 0` disables expansion
- [ ] `--since` filters by event timestamp
- [ ] `-p` scopes to project
- [ ] Human output is colored and readable
- [ ] Robot mode returns structured JSON with expansion provenance (`via`) for expanded entities
- [ ] Unresolved (external) references included in JSON output
---
## Gate 4: File Decision History (`lore file-history`)
### 4.1 Schema (Migration 011)
**File:** `migrations/011_file_changes.sql`
```sql
-- Files changed by each merge request
-- Source: GET /projects/:id/merge_requests/:iid/diffs
CREATE TABLE mr_file_changes (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER NOT NULL REFERENCES merge_requests(id) ON DELETE CASCADE,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
old_path TEXT, -- NULL for new files
new_path TEXT NOT NULL,
change_type TEXT NOT NULL CHECK (change_type IN ('added', 'modified', 'deleted', 'renamed')),
UNIQUE(merge_request_id, new_path)
);
CREATE INDEX idx_mr_files_new_path ON mr_file_changes(new_path);
CREATE INDEX idx_mr_files_old_path ON mr_file_changes(old_path)
WHERE old_path IS NOT NULL;
CREATE INDEX idx_mr_files_mr ON mr_file_changes(merge_request_id);
-- Add commit SHAs to merge_requests (cherry-picked from Phase A)
-- These link MRs to actual git history
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;
```
### 4.2 Config Extension
```json
{
"sync": {
"fetchMrFileChanges": true
}
}
```
Opt-in. When enabled, the sync pipeline fetches `GET /projects/:id/merge_requests/:iid/diffs` for each changed MR and extracts file metadata. Diff content is **not stored** — only file paths and change types.
### 4.3 Ingestion
**Uses the generic dependent fetch queue (`job_type = 'mr_diffs'`):**
1. After MR ingestion, if `fetchMrFileChanges` is true, enqueue a `mr_diffs` job in `pending_dependent_fetches`.
2. Parse response: `changes[].{old_path, new_path, new_file, renamed_file, deleted_file}`.
3. Derive `change_type`:
- `new_file == true``'added'`
- `renamed_file == true``'renamed'`
- `deleted_file == true``'deleted'`
- else → `'modified'`
4. Upsert into `mr_file_changes`. On re-sync, DELETE existing rows for the MR and re-insert (diffs can change if MR is rebased).
**API call cost:** 1 additional call per MR. Acceptable for incremental sync (1050 MRs/day).
### 4.4 Command Design
```bash
# Show decision history for a file
lore file-history src/auth/oauth.rs
# Scoped to project (required if file path exists in multiple projects)
lore file-history src/auth/oauth.rs -p group/repo
# Include discussions on the MRs
lore file-history src/auth/oauth.rs --discussions
# Follow rename chains (default: on)
lore file-history src/auth/oauth.rs # follows renames automatically
lore file-history src/auth/oauth.rs --no-follow-renames # disable rename chain resolution
# Limit results
lore file-history src/auth/oauth.rs -n 10
# Filter to merged MRs only
lore file-history src/auth/oauth.rs --merged
# Robot mode
lore -J file-history src/auth/oauth.rs
```
### 4.5 Query Logic
```sql
SELECT
mr.iid,
mr.title,
mr.state,
mr.author_username,
mr.merged_at,
mr.created_at,
mr.web_url,
mr.merge_commit_sha,
mfc.change_type,
mfc.old_path,
(SELECT COUNT(*) FROM discussions d
WHERE d.merge_request_id = mr.id) AS discussion_count,
(SELECT COUNT(*) FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE d.merge_request_id = mr.id
AND n.position_new_path = ?1) AS file_discussion_count
FROM mr_file_changes mfc
JOIN merge_requests mr ON mr.id = mfc.merge_request_id
WHERE mfc.new_path = ?1 OR mfc.old_path = ?1
ORDER BY COALESCE(mr.merged_at, mr.created_at) DESC;
```
For each MR, optionally fetch related issues via `entity_references` (Gate 2 data).
### 4.6 Rename Handling
File renames are tracked via `old_path` and resolved as bounded chains:
1. Start with the query path in the path set: `{src/auth/oauth.rs}`
2. Search `mr_file_changes` for rows where `change_type = 'renamed'` and either `new_path` or `old_path` is in the path set
3. Add the other side of each rename to the path set
4. Repeat until no new paths are discovered, up to a maximum of 10 hops (configurable)
5. Use the full path set for the file history query
**Safeguards:**
- Hop cap (default 10) prevents runaway expansion
- Cycle detection: if a path is already in the set, skip it
- The unioned path set is used for matching MRs in the main query
**Output:**
- Human mode annotates the rename chain: `"src/auth/oauth.rs (renamed from src/auth/handler.rs ← src/auth.rs)"`
- Robot mode JSON includes `rename_chain`: `["src/auth.rs", "src/auth/handler.rs", "src/auth/oauth.rs"]`
- `--no-follow-renames` disables chain resolution (matches only the literal path provided)
### 4.7 Acceptance Criteria
- [ ] `mr_file_changes` table populated from GitLab diffs API
- [ ] `merge_commit_sha` and `squash_commit_sha` captured in `merge_requests`
- [ ] `lore file-history <path>` returns MRs ordered by merge/creation date
- [ ] Output includes: MR title, state, author, change type, discussion count
- [ ] `--discussions` shows inline discussion snippets from DiffNotes on the file
- [ ] Rename chains resolved with bounded hop count (default 10) and cycle detection
- [ ] `--no-follow-renames` disables chain resolution
- [ ] Robot mode JSON includes `rename_chain` when renames are detected
- [ ] Robot mode JSON output
- [ ] `-p` required when path exists in multiple projects (Ambiguous error)
---
## Gate 5: Code Trace (`lore trace`)
### 5.1 Overview
`lore trace` answers "Why was this code introduced?" by tracing from a file (and optionally a line number) back through the MR and issue that motivated the change.
### 5.2 Two-Tier Architecture
**Tier 1 — API-only (no local git required):**
Uses `merge_commit_sha` and `squash_commit_sha` from the `merge_requests` table to link MRs to commits. Combined with `mr_file_changes`, this can answer "which MRs touched this file" and link to their motivating issues via `entity_references`.
This is equivalent to `lore file-history` enriched with issue context — effectively a file-scoped decision timeline.
**Tier 2 — Git integration (requires local clone):**
Uses `git blame` to map a specific line to a commit SHA, then resolves the commit to an MR via `merge_commit_sha` lookup. This provides line-level precision.
**Gate 5 ships Tier 1 only.** Tier 2 (git integration via `git2-rs`) is a future enhancement.
### 5.3 Command Design
```bash
# Trace a file's history (Tier 1: API-only)
lore trace src/auth/oauth.rs
# Trace a specific line (Tier 2: requires local git)
lore trace src/auth/oauth.rs:45
# Robot mode
lore -J trace src/auth/oauth.rs
```
### 5.4 Query Flow (Tier 1)
```
1. Find MRs that touched this file (mr_file_changes)
2. For each MR, find related issues (entity_references WHERE reference_type = 'closes')
3. For each issue, fetch discussions with rationale
4. Build trace chain: file → MR → issue → discussions
5. Order by merge date (most recent first)
```
### 5.5 Output Format (Human)
```
lore trace src/auth/oauth.rs
Trace: src/auth/oauth.rs
────────────────────────
!567 feat: add OAuth2 provider MERGED 2024-03-25
→ Closes #234: Migrate to OAuth2
→ 12 discussion comments, 4 on this file
→ Decision: Use rust-oauth2 crate (discussed in #234, comment by @alice)
!612 fix: token refresh race condition MERGED 2024-04-10
→ Closes #299: OAuth2 login fails for SSO users
→ 5 discussion comments, 2 on this file
→ [src/auth/oauth.rs:45] "Add mutex around refresh to prevent double-refresh"
!701 refactor: extract TokenManager MERGED 2024-05-01
→ Related: #312: Reduce auth module complexity
→ 3 discussion comments
→ Note: file was renamed from src/auth/handler.rs
```
### 5.6 Tier 2 Design Notes (Future — Not in This Phase)
When git integration is added:
1. Add `git2-rs` dependency for native git operations
2. Implement `git blame -L <line>,<line> <file>` to get commit SHA for a specific line
3. Look up commit SHA in `merge_requests.merge_commit_sha` or `merge_requests.squash_commit_sha`
4. If no match (commit was squashed), search `merge_commit_sha` for commits in the blame range
5. Optional `blame_cache` table for performance (invalidated by content hash)
**Known limitation:** Squash commits break blame-to-MR mapping for individual commits within an MR. The squash commit SHA maps to the MR, but all lines show the same commit. This is a fundamental Git limitation documented in [GitLab Forum #77146](https://forum.gitlab.com/t/preserve-blame-in-squash-merge/77146).
### 5.7 Acceptance Criteria (Tier 1 Only)
- [ ] `lore trace <file>` shows MRs that touched the file with linked issues and discussion context
- [ ] Output includes the MR → issue → discussion chain
- [ ] Discussion snippets show DiffNote content on the traced file
- [ ] Cross-references from `entity_references` used for MR→issue linking
- [ ] Robot mode JSON output
- [ ] Graceful handling when no MR data found ("Run `lore sync` with `fetchMrFileChanges: true`")
---
## Migration Strategy
### Migration Numbering
Phase B uses migration numbers starting at 010:
| Migration | Content | Gate |
|-----------|---------|------|
| 010 | Resource event tables, generic dependent fetch queue, entity_references | Gates 1, 2 |
| 011 | mr_file_changes, merge_commit_sha, squash_commit_sha | Gate 4 |
Phase A's complete field capture migration should use 012+ when implemented, skipping fields already added by 011 (`merge_commit_sha`, `squash_commit_sha`).
### Backward Compatibility
- All new tables are additive (no ALTER on existing data-bearing columns)
- `lore sync` works without event data — temporal commands gracefully report "No event data. Run `lore sync` to populate."
- Existing search, issues, mrs commands are unaffected
---
## Risks and Mitigations
### Identified During Premortem
| Risk | Severity | Mitigation |
|------|----------|------------|
| API call volume explosion (3 event calls per entity) | Medium | Incremental sync limits to changed entities; opt-in config flag |
| System note parsing fragile for non-English instances | Medium | Used only for assignee changes and cross-refs; `source_method` tracks provenance |
| GitLab diffs API returns large payloads | Low | Extract file metadata only, discard diff content |
| Cross-reference graph traversal unbounded | Medium | BFS depth capped at configurable limit (default 1); `mentioned` edges excluded by default |
| Cross-project references lost when target not synced | Medium | Unresolved references stored with `target_entity_id = NULL`; still appear in timeline output |
| Phase A migration numbering conflict | Low | Phase B uses 010-011; Phase A uses 012+ |
| Timeline output lacks "why" evidence | Medium | Evidence-bearing notes from FTS5 included as first-class timeline events |
| Squash commits break blame-to-MR mapping | Medium | Tier 2 (git integration) deferred; Tier 1 uses file-level MR matching |
### Accepted Limitations
- **No real-time monitoring.** Phase B is batch queries over historical data. "Notify me when my code changes" requires a different architecture (webhooks, polling daemon) and is out of scope.
- **No pattern evolution.** Cross-project trend detection requires all of Phase B's infrastructure plus semantic clustering. Deferred to Phase C.
- **English-only system note parsing.** Cross-reference extraction from system notes works reliably only for English-language GitLab instances. Structured API data works for all languages.
- **Bounded rename chain resolution.** `lore file-history` resolves rename chains up to 10 hops with cycle detection. Pathological rename histories (>10 hops) are truncated.
- **Evidence notes are keyword-matched, not summarized.** Timeline evidence notes are the raw FTS5-matched note text, not AI-generated summaries. This keeps the system deterministic and avoids LLM dependencies.
---
## Success Metrics
| Metric | Target |
|--------|--------|
| `lore timeline` query latency | < 200ms for typical queries (< 50 seed entities) |
| Timeline event coverage | State + label + creation + merge + evidence note events for all synced entities |
| Timeline evidence quality | Top 10 FTS5-matched notes included per query; at least 1 evidence note for queries matching discussion-bearing entities |
| Cross-reference coverage | > 80% of "closed by MR" relationships captured via structured API |
| Unresolved reference capture | Cross-project references stored even when target project is not synced |
| Incremental sync overhead | < 5% increase in sync time for event fetching |
| `lore file-history` coverage | File changes captured for all synced MRs (when opt-in enabled) |
| Rename chain resolution | Multi-hop renames correctly resolved up to 10 hops |
---
## Future Phases (Out of Scope)
### Phase C: Advanced Temporal Features
- Pattern Evolution: cross-project trend detection via embedding clusters
- Git integration (Tier 2): `git blame` → commit → MR resolution
- MCP server: expose `timeline`, `file-history`, `trace` as typed MCP tools
### Phase D: Consumer Applications
- Web UI: separate frontend consuming lore's JSON API via `lore serve`
- Real-time monitoring: webhook listener or polling daemon for change notifications
- IDE integration: editor plugins surfacing temporal context inline