Files
gitlore/docs/phase-b-temporal-intelligence.md
Taylor Eernisse 233eb546af feat: Add commit SHAs, closes_issues watermark, and PRD alignment
Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests
(Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark
for incremental sync, and the missing idx_label_events_label index.

The MR transformer and ingestion pipeline now populate commit SHAs during
sync. The orchestrator uses watermark-based filtering for closes_issues
jobs instead of re-enqueuing all MRs every sync.

The Phase B PRD is updated to match the actual codebase: corrected
migration numbering (011-015), documented nullable label/milestone
fields (migration 012), watermark patterns (013), observability
infrastructure (014), simplified source_method values, and updated
entity_references schema to match implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-05 15:29:51 -05:00

1010 lines
47 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase B: Temporal Intelligence Foundation
> **Status:** Draft
> **Prerequisite:** CP3 Gates B+C complete (working search + sync pipeline)
> **Goal:** Transform gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab and exposing temporal queries that answer "why" and "when" questions about project history.
---
## Motivation
gitlore currently stores **snapshots** — the latest state of each issue, MR, and discussion. But temporal queries need **change history**. When an issue's labels change from `priority::low` to `priority::critical`, the current schema overwrites the label junction. The transition is lost.
GitLab issues, MRs, and discussions contain the raw ingredients for temporal intelligence: state transitions, label mutations, assignee changes, cross-references between entities, and decision rationale in discussions. What's missing is a structured temporal index that makes these ingredients queryable.
### The Problem This Solves
Today, when an AI agent or developer asks "Why did the team switch from REST to GraphQL?" or "What happened with the auth migration?", the answer is scattered across paginated API responses with no temporal index, no cross-referencing, and no semantic layer. Reconstructing a decision timeline manually takes 20+ minutes of clicking through GitLab's UI. This phase makes it take 2 seconds.
### Forcing Function
This phase is designed around one concrete question: **"What happened with X?"** — where X is any keyword, feature name, or initiative. If `lore timeline "auth migration"` can produce a useful, chronologically-ordered narrative of all related events across issues, MRs, and discussions, the architecture is validated. If it can't, we learn what's missing before investing in deeper temporal features.
---
## Executive Summary (Gated Milestones)
Five gates, each independently verifiable and shippable:
**Gate 1 (Resource Events Ingestion):** Structured event data from GitLab APIs → local event tables
**Gate 2 (Cross-Reference Extraction):** Entity relationship graph from structured APIs + system note parsing
**Gate 3 (Decision Timeline):** `lore timeline` command — keyword-driven chronological narrative
**Gate 4 (File Decision History):** `lore file-history` command — MR-to-file linking + scoped timelines
**Gate 5 (Code Trace):** `lore trace` command — file:line → commit → MR → issue → rationale chain
### Key Design Decisions
- **Structured APIs over text parsing.** GitLab provides Resource Events APIs (`resource_state_events`, `resource_label_events`, `resource_milestone_events`) that return clean JSON. These are the primary data source for temporal events. System note parsing is a fallback for events without structured APIs (assignee changes, cross-references).
- **Dependent resource pattern.** Resource events are fetched per-entity, triggered by the existing dirty source tracking. Same architecture as discussion fetching — queue-based, resumable, incremental.
- **Opt-in event ingestion.** New config flag `sync.fetchResourceEvents` (default `true`) controls whether the sync pipeline fetches event data. Users who don't need temporal features skip the additional API calls.
- **Application-level graph traversal.** Cross-reference expansion uses BFS in Rust, not recursive SQL CTEs. Capped at configurable depth (default 1) for predictable performance.
- **Evolutionary library extraction.** New commands are built with typed return structs from day one. Old commands are not retrofitted until a concrete consumer (MCP server, web UI) requires it.
- **Phase A fields cherry-picked as needed.** `merge_commit_sha` and `squash_commit_sha` are added in migration 015 and populated during MR ingestion. Remaining Phase A fields are handled in their own migration later.
### Scope Boundaries
**In scope:**
- Batch temporal queries over historical data
- Structured event ingestion from GitLab APIs
- Cross-reference graph construction
- CLI commands with robot mode JSON output
**Out of scope (future phases):**
- Real-time monitoring / notifications ("alert me when my code changes")
- MCP server (Phase C — consumes the library API this phase produces)
- Web UI (Phase D — consumes the same library API)
- Pattern evolution / cross-project trend detection (Phase C)
- Library extraction refactor (happens organically as new commands are added)
---
## Gate 1: Resource Events Ingestion
### 1.1 Rationale: Why Not Parse System Notes?
The original approach was to parse system note body text with regex to extract state changes and label mutations. Research revealed this is the wrong approach:
1. **Structured APIs exist.** GitLab's Resource Events APIs return clean JSON with explicit `action`, `state`, and `label` fields. Available on all tiers (Free, Premium, Ultimate).
2. **System notes are localized.** A French GitLab instance says `"ajouté l'étiquette ~bug"` — regex breaks for non-English instances.
3. **Label events aren't in the Notes API.** Per [GitLab Issue #24661](https://gitlab.com/gitlab-org/gitlab/-/issues/24661), label change system notes are not returned by the Notes API. The Resource Label Events API is the only reliable source.
4. **No versioned format spec.** System note text has changed across GitLab 14.x17.x with no documentation of format changes.
System note parsing is still used for events without structured APIs (see Gate 2), but with the explicit understanding that it's best-effort and fragile for non-English instances.
### 1.2 Schema (Migration 011)
**File:** `migrations/011_resource_events.sql`
```sql
-- State change events (opened, closed, reopened, merged, locked)
-- Source: GET /projects/:id/issues/:iid/resource_state_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_state_events
CREATE TABLE resource_state_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
state TEXT NOT NULL, -- 'opened' | 'closed' | 'reopened' | 'merged' | 'locked'
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
source_commit TEXT, -- commit SHA that caused this state change
source_merge_request_iid INTEGER, -- iid from source_merge_request ref
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE UNIQUE INDEX uq_state_events_gitlab ON resource_state_events(gitlab_id, project_id);
CREATE INDEX idx_state_events_issue ON resource_state_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_state_events_mr ON resource_state_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_state_events_created ON resource_state_events(created_at);
-- Label change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_label_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_label_events
CREATE TABLE resource_label_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
label_name TEXT, -- nullable: GitLab returns null for deleted labels (see §1.2.1)
actor_gitlab_id INTEGER,
actor_username TEXT,
created_at INTEGER NOT NULL, -- ms epoch UTC
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE UNIQUE INDEX uq_label_events_gitlab ON resource_label_events(gitlab_id, project_id);
CREATE INDEX idx_label_events_issue ON resource_label_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_label_events_mr ON resource_label_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_label_events_created ON resource_label_events(created_at);
-- Note: idx_label_events_label was added in migration 015 (not in the original 011)
-- Milestone change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_milestone_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_milestone_events
CREATE TABLE resource_milestone_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
milestone_title TEXT, -- nullable: GitLab returns null for deleted milestones (see §1.2.1)
milestone_id INTEGER,
actor_gitlab_id INTEGER,
actor_username TEXT,
created_at INTEGER NOT NULL, -- ms epoch UTC
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE UNIQUE INDEX uq_milestone_events_gitlab ON resource_milestone_events(gitlab_id, project_id);
CREATE INDEX idx_milestone_events_issue ON resource_milestone_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_milestone_events_mr ON resource_milestone_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_milestone_events_created ON resource_milestone_events(created_at);
```
#### 1.2.1 Nullable Label and Milestone Fields (Migration 012)
GitLab returns `null` for `label` and `milestone` in Resource Events when the referenced label or milestone has been deleted from the project. This was discovered in production after the initial schema deployed with `NOT NULL` constraints.
**Migration 012** recreates `resource_label_events` and `resource_milestone_events` with nullable `label_name` and `milestone_title` columns. The table-swap approach (create new → copy → drop old → rename) is required because SQLite doesn't support `ALTER COLUMN`.
Timeline queries that encounter null labels/milestones display `"[deleted label]"` or `"[deleted milestone]"` in human output and omit the name field in robot JSON.
#### 1.2.2 Resource Event Watermarks (Migration 013)
To avoid re-fetching resource events for every entity on every sync, a watermark column tracks the `updated_at` value at the time of last successful event fetch:
```sql
ALTER TABLE issues ADD COLUMN resource_events_synced_for_updated_at INTEGER;
ALTER TABLE merge_requests ADD COLUMN resource_events_synced_for_updated_at INTEGER;
```
**Incremental behavior:** During sync, only entities where `updated_at > COALESCE(resource_events_synced_for_updated_at, 0)` are enqueued for resource event fetching. On `--full` sync, these watermarks are reset to `NULL`, causing all entities to be re-enqueued.
This mirrors the existing `discussions_synced_for_updated_at` pattern and works in conjunction with the dependent fetch queue.
### 1.3 Config Extension
**File:** `src/core/config.rs`
Add to `SyncConfig`:
```rust
/// Fetch resource events (state, label, milestone changes) during sync.
/// Increases API calls but enables temporal queries (lore timeline, etc.).
/// Default: true
#[serde(default = "default_true")]
pub fetch_resource_events: bool,
```
**Config file example:**
```json
{
"sync": {
"fetchResourceEvents": true
}
}
```
### 1.4 GitLab API Client
**New endpoints in `src/gitlab/client.rs`:**
```
GET /projects/:id/issues/:iid/resource_state_events?per_page=100
GET /projects/:id/issues/:iid/resource_label_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_state_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_label_events?per_page=100
GET /projects/:id/issues/:iid/resource_milestone_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_milestone_events?per_page=100
```
All endpoints use standard pagination. Fetch all pages per entity.
**New serde types in `src/gitlab/types.rs`:**
```rust
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabStateEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String, // "Issue" | "MergeRequest"
pub resource_id: i64,
pub state: String, // "opened" | "closed" | "reopened" | "merged" | "locked"
pub source_commit: Option<String>,
pub source_merge_request: Option<GitLabMergeRequestRef>,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabLabelEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub label: Option<GitLabLabelRef>, // nullable: deleted labels return null
pub action: String, // "add" | "remove"
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabMilestoneEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub milestone: Option<GitLabMilestoneRef>, // nullable: deleted milestones return null
pub action: String, // "add" | "remove"
}
```
### 1.5 Ingestion Pipeline
**Architecture:** Generic dependent-fetch queue, generalizing the `pending_discussion_fetches` pattern. A single queue table serves all dependent resource types across Gates 1, 2, and 4, avoiding schema churn as new fetch types are added.
**New queue table (in migration 011):**
```sql
-- Generic queue for all dependent resource fetches (events, closes_issues, diffs)
-- Replaces per-type queue tables with a unified job model
CREATE TABLE pending_dependent_fetches (
id INTEGER PRIMARY KEY,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
entity_type TEXT NOT NULL CHECK (entity_type IN ('issue', 'merge_request')),
entity_iid INTEGER NOT NULL,
entity_local_id INTEGER NOT NULL,
job_type TEXT NOT NULL CHECK (job_type IN (
'resource_events', -- Gate 1: state + label + milestone events
'mr_closes_issues', -- Gate 2: closes_issues API
'mr_diffs' -- Gate 4: MR file changes
)),
payload_json TEXT, -- job-specific params, e.g. {"event_types":["state","label","milestone"]}
enqueued_at INTEGER NOT NULL,
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
next_retry_at INTEGER,
locked_at INTEGER, -- crash recovery: NULL = available, non-NULL = in progress
UNIQUE(project_id, entity_type, entity_iid, job_type)
);
```
The `locked_at` column provides crash recovery: if a sync process crashes mid-drain, stale locks (older than 5 minutes) are automatically reclaimed on the next `lore sync` run. This is intentionally minimal — full job leasing with `locked_by` and lease expiration is unnecessary for a single-process CLI tool.
**Flow:**
1. During issue/MR ingestion, when an entity is upserted (new or updated), enqueue jobs in `pending_dependent_fetches`:
- For all entities: `job_type = 'resource_events'` (when `fetchResourceEvents` is true)
- For MRs: `job_type = 'mr_closes_issues'` (always, for Gate 2)
- For MRs: `job_type = 'mr_diffs'` (when `fetchMrFileChanges` is true, for Gate 4)
2. After primary ingestion completes, drain the dependent fetch queue:
- Claim jobs: `UPDATE ... SET locked_at = now WHERE locked_at IS NULL AND (next_retry_at IS NULL OR next_retry_at <= now)`
- For each job, dispatch by `job_type` to the appropriate fetcher
- On success: DELETE the job row
- On transient failure: increment `attempts`, set `next_retry_at` with exponential backoff, clear `locked_at`
3. `lore sync` drains dependent jobs after ingestion + discussion fetch steps.
**Incremental behavior:** Only entities that changed since last sync are enqueued. On `--full` sync, all entities are re-enqueued.
### 1.6 API Call Budget
Per entity: 3 API calls (state + label + milestone) for issues, 3 for MRs.
| Scenario | Entities | API Calls | Time at 2k req/min |
|----------|----------|-----------|---------------------|
| Initial sync, 500 issues + 200 MRs | 700 | 2,100 | ~1 min |
| Initial sync, 2,000 issues + 1,000 MRs | 3,000 | 9,000 | ~4.5 min |
| Incremental sync, 20 changed entities | 20 | 60 | <2 sec |
Acceptable for initial sync. Incremental sync adds negligible overhead.
**Optimization (future):** If milestone events prove low-value, make them opt-in to reduce calls by 1/3.
### 1.7 Acceptance Criteria
- [x] Migration 011 creates all three event tables + generic dependent fetch queue
- [x] `lore sync` fetches resource events for changed entities when `fetchResourceEvents` is true
- [x] `lore sync --no-events` skips event fetching
- [x] Event fetch failures are queued for retry with exponential backoff
- [x] Stale locks (crashed sync) automatically reclaimed on next run
- [x] `lore count events` shows event counts by type
- [ ] `lore stats --check` validates event table referential integrity
- [ ] `lore stats --check` validates dependent job queue health (no stuck locks, retryable jobs visible)
- [ ] Robot mode JSON for all new commands
### 1.8 Observability Infrastructure (Migration 014)
The sync pipeline includes lightweight observability via `sync_runs` enrichment. Migration 014 adds:
```sql
ALTER TABLE sync_runs ADD COLUMN run_id TEXT; -- correlation ID for log tracing
ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0;
ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0;
CREATE INDEX IF NOT EXISTS idx_sync_runs_run_id ON sync_runs(run_id);
```
**Purpose:** The `run_id` column correlates log entries (via `tracing`) with sync run records. `total_items_processed` and `total_errors` provide aggregate counts for `lore sync-status` and robot mode health checks without requiring log parsing.
This is separate from the event tables but supports the same operational workflow — answering "did the last sync succeed?" and "how many entities were processed?" programmatically.
---
## Gate 2: Cross-Reference Extraction
### 2.1 Rationale
Temporal queries need to follow links between entities: "MR !567 closed issue #234", "issue #234 mentioned in MR !567", "#299 was opened as a follow-up to !567". These relationships are captured in two places:
1. **Structured API:** `GET /projects/:id/merge_requests/:iid/closes_issues` returns issues that close when the MR merges. Also, `resource_state_events` includes `source_merge_request_iid` for "closed by MR" events.
2. **System notes:** Cross-references like "mentioned in !456" and "closed by !789" appear in system note body text.
### 2.2 Schema (in Migration 011)
```sql
-- Cross-references between entities
-- Populated from: closes_issues API, state events, system note parsing
--
-- Directionality convention:
-- source = the entity where the reference was *observed* (contains the note, or is the MR in closes_issues)
-- target = the entity being *referenced* (the issue closed, the MR mentioned)
-- This is consistent across all source_methods and enables predictable BFS traversal.
--
-- Unresolved references: when a cross-reference points to an entity in a project
-- that isn't synced locally, target_entity_id is NULL but target_project_path and
-- target_entity_iid are populated. This preserves valuable edges rather than
-- silently dropping them. Timeline output marks these as "[external]".
CREATE TABLE entity_references (
id INTEGER PRIMARY KEY,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
source_entity_type TEXT NOT NULL CHECK (source_entity_type IN ('issue', 'merge_request')),
source_entity_id INTEGER NOT NULL, -- local DB id
target_entity_type TEXT NOT NULL CHECK (target_entity_type IN ('issue', 'merge_request')),
target_entity_id INTEGER, -- local DB id (NULL when target is unresolved/external)
target_project_path TEXT, -- e.g. "group/other-repo" (populated for cross-project refs)
target_entity_iid INTEGER, -- GitLab iid (populated when target_entity_id is NULL)
reference_type TEXT NOT NULL CHECK (reference_type IN ('closes', 'mentioned', 'related')),
source_method TEXT NOT NULL CHECK (source_method IN ('api', 'note_parse', 'description_parse')),
created_at INTEGER NOT NULL -- ms epoch UTC
);
-- Unique constraint includes source_method: the same relationship can be discovered by
-- multiple methods (e.g., closes_issues API and a state event), and we store both for provenance.
CREATE UNIQUE INDEX uq_entity_refs ON entity_references(
project_id, source_entity_type, source_entity_id, target_entity_type,
COALESCE(target_entity_id, -1), COALESCE(target_project_path, ''),
COALESCE(target_entity_iid, -1), reference_type, source_method
);
CREATE INDEX idx_entity_refs_source ON entity_references(source_entity_type, source_entity_id);
CREATE INDEX idx_entity_refs_target ON entity_references(target_entity_id)
WHERE target_entity_id IS NOT NULL;
CREATE INDEX idx_entity_refs_unresolved ON entity_references(target_project_path, target_entity_iid)
WHERE target_entity_id IS NULL;
```
**`source_method` values:**
| Value | Meaning |
|-------|---------|
| `'api'` | Populated from structured GitLab APIs (`closes_issues`, `resource_state_events`) |
| `'note_parse'` | Extracted from system note body text (best-effort, English only) |
| `'description_parse'` | Extracted from issue/MR description body text (future) |
The original design used more granular values (`'api_closes_issues'`, `'api_state_event'`, `'system_note_parse'`). In practice, the API-sourced references don't need sub-method distinction — the `reference_type` already captures the semantic relationship — so the implementation simplified to three values.
### 2.3 Population Strategy
**Tier 1 — Structured APIs (reliable):**
1. **`closes_issues` endpoint:** After MR ingestion, fetch `GET /projects/:id/merge_requests/:iid/closes_issues`. Insert `reference_type = 'closes'`, `source_method = 'api'`. Source = MR, target = issue.
2. **State events:** When `resource_state_events` contains `source_merge_request_iid`, insert `reference_type = 'closes'`, `source_method = 'api'`. Source = MR (referenced by iid), target = issue (that received the state change).
**Tier 2 — System note parsing (best-effort):**
Parse system notes where `is_system = 1` for cross-reference patterns.
**Directionality rule:** Source = entity containing the system note. Target = entity referenced by the note text. This is consistent with Tier 1's convention.
```
mentioned in !{iid}
mentioned in #{iid}
mentioned in {group}/{project}!{iid}
mentioned in {group}/{project}#{iid}
closed by !{iid}
closed by #{iid}
```
**Cross-project references:** When a system note references `{group}/{project}#{iid}` and the target project is not synced locally, store with `target_entity_id = NULL`, `target_project_path = '{group}/{project}'`, `target_entity_iid = {iid}`. These unresolved references are still valuable for timeline narratives — they indicate external dependencies and decision context even when we can't traverse further.
Insert with `source_method = 'note_parse'`. Accept that:
- This breaks on non-English GitLab instances
- Format may vary across GitLab versions
- Log parse failures at `debug` level for monitoring
**Tier 3 — Description/body parsing (`source_method = 'description_parse'`, deferred):**
Issue and MR descriptions often contain `#123` or `!456` references. Parsing these is lower confidence (mentions != relationships) and is deferred to a future iteration. The `source_method` value `'description_parse'` is reserved in the CHECK constraint for this future work.
### 2.4 Ingestion Flow
The `closes_issues` fetch uses the generic dependent fetch queue (`job_type = 'mr_closes_issues'`):
- After MR ingestion, a `mr_closes_issues` job is enqueued alongside `resource_events` jobs
- One additional API call per MR: `GET /projects/:id/merge_requests/:iid/closes_issues`
- Cross-reference parsing from system notes runs as a local post-processing step (no API calls) after all dependent fetches complete
**Watermark pattern (migration 015):** A `closes_issues_synced_for_updated_at` column on `merge_requests` tracks the last `updated_at` value at which closes_issues data was fetched. Only MRs where `updated_at > COALESCE(closes_issues_synced_for_updated_at, 0)` are enqueued for re-fetching. The watermark is updated after successful fetch or after a permanent API error (e.g., 404 for external MRs). On `--full` sync, the watermark is reset to `NULL`.
### 2.5 Acceptance Criteria
- [ ] `entity_references` table populated from `closes_issues` API for all synced MRs
- [ ] `entity_references` table populated from `resource_state_events` where `source_merge_request_id` is present
- [ ] System notes parsed for cross-reference patterns (English instances)
- [ ] Cross-project references stored as unresolved when target project is not synced
- [ ] `source_method` column tracks provenance of each reference
- [ ] References are deduplicated (same relationship from multiple sources stored once)
- [ ] Timeline JSON includes expansion provenance (`via`) for all expanded entities
---
## Gate 3: Decision Timeline (`lore timeline`)
### 3.1 Command Design
```bash
# Basic: keyword-driven timeline
lore timeline "auth migration"
# Scoped to project
lore timeline "auth migration" -p group/repo
# Limit date range
lore timeline "auth migration" --since 6m
lore timeline "auth migration" --since 2024-01-01
# Control cross-reference expansion depth
lore timeline "auth migration" --depth 0 # No expansion (matched entities only)
lore timeline "auth migration" --depth 1 # Follow direct references (default)
lore timeline "auth migration" --depth 2 # Two hops
# Control which edge types are followed during expansion
lore timeline "auth migration" --expand-mentions # Also follow 'mentioned' edges (off by default)
# Default expansion follows 'closes' and 'related' edges only.
# 'mentioned' edges are excluded by default because they have high fan-out
# and often connect tangentially related entities.
# Limit results
lore timeline "auth migration" -n 50
# Robot mode
lore -J timeline "auth migration"
```
### 3.2 Query Flow
```
1. SEED: FTS5 keyword search → matched document IDs (issues, MRs, and notes/discussions)
2. HYDRATE:
- Map document IDs → source entities (issues, MRs)
- Collect top matched notes as evidence candidates (bounded, default top 10)
These are the actual decision-bearing comments that answer "why"
3. EXPAND: Follow entity_references (BFS, depth-limited)
→ Discover related entities not matched by keywords
→ Default: follow 'closes' + 'related' edges; skip 'mentioned' unless --expand-mentions
→ Unresolved (external) references included in output but not traversed further
4. COLLECT EVENTS: For all entities (seed + expanded):
- Entity creation (created_at from issues/merge_requests)
- State changes (resource_state_events)
- Label changes (resource_label_events)
- Milestone changes (resource_milestone_events)
- Evidence notes: top FTS5-matched notes as discrete events (snippet + author + url)
- Merge events (merged_at from merge_requests)
5. INTERLEAVE: Sort all events chronologically
6. RENDER: Format as timeline (human or JSON)
```
**Why evidence notes instead of "discussion activity summarized":** The forcing function is "What happened with X?" A timeline entry that says "3 new comments" doesn't answer *why* — it answers *how many*. By including the top FTS5-matched notes as first-class timeline events, the timeline surfaces the actual decision rationale, code review feedback, and architectural reasoning that motivated changes. This uses the existing search infrastructure (CP3) with no new indexing required.
### 3.3 Event Model
The timeline doesn't store a separate unified event table. Instead, it queries across the existing tables at read time and produces a virtual event stream:
```rust
pub struct TimelineEvent {
pub timestamp: i64, // ms epoch
pub entity_type: String, // "issue" | "merge_request" | "discussion"
pub entity_iid: i64,
pub project_path: String,
pub event_type: TimelineEventType,
pub summary: String, // human-readable one-liner
pub actor: Option<String>, // username
pub url: Option<String>,
pub is_seed: bool, // matched by keyword (vs. expanded via reference)
}
pub enum TimelineEventType {
Created, // entity opened/created
StateChanged { state: String }, // closed, reopened, merged, locked
LabelAdded { label: String },
LabelRemoved { label: String },
MilestoneSet { milestone: String },
MilestoneRemoved { milestone: String },
Merged,
NoteEvidence { // FTS5-matched note surfacing decision rationale
note_id: i64,
snippet: String, // first ~200 chars of the matching note body
discussion_id: Option<i64>,
},
CrossReferenced { target: String },
}
```
### 3.4 Human Output Format
```
lore timeline "auth migration"
Timeline: "auth migration" (12 events across 4 entities)
───────────────────────────────────────────────────────
2024-03-15 CREATED #234 Migrate to OAuth2 @alice
Labels: ~auth, ~breaking-change
2024-03-18 CREATED !567 feat: add OAuth2 provider @bob
References: #234
2024-03-20 NOTE #234 "Should we support SAML too? I think @charlie
we should stick with OAuth2 for now..."
2024-03-22 LABEL !567 added ~security-review @alice
2024-03-24 NOTE !567 [src/auth/oauth.rs:45] @dave
"Consider refresh token rotation to
prevent session fixation attacks"
2024-03-25 MERGED !567 feat: add OAuth2 provider @alice
2024-03-26 CLOSED #234 closed by !567 @alice
2024-03-28 CREATED #299 OAuth2 login fails for SSO users @dave [expanded]
(via !567, closes)
───────────────────────────────────────────────────────
Seed entities: #234, !567 | Expanded: #299 (depth 1, via !567)
```
Entities discovered via cross-reference expansion are marked `[expanded]` with a compact provenance note showing which seed entity and edge type led to their discovery.
Evidence notes (`NOTE` events) show the first ~200 characters of FTS5-matched note bodies. These are the actual decision-bearing comments that answer "why" — not just activity counts.
### 3.5 Robot Mode JSON
```json
{
"ok": true,
"data": {
"query": "auth migration",
"event_count": 12,
"seed_entities": [
{ "type": "issue", "iid": 234, "project": "group/repo" },
{ "type": "merge_request", "iid": 567, "project": "group/repo" }
],
"expanded_entities": [
{
"type": "issue",
"iid": 299,
"project": "group/repo",
"depth": 1,
"via": {
"from": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"reference_type": "closes",
"source_method": "api"
}
}
],
"unresolved_references": [
{
"source": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"target_project": "group/other-repo",
"target_type": "issue",
"target_iid": 42,
"reference_type": "mentioned"
}
],
"events": [
{
"timestamp": "2024-03-15T10:00:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "created",
"summary": "Migrate to OAuth2",
"actor": "alice",
"url": "https://gitlab.com/group/repo/-/issues/234",
"is_seed": true,
"details": {
"labels": ["auth", "breaking-change"]
}
},
{
"timestamp": "2024-03-20T14:30:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "note_evidence",
"summary": "Should we support SAML too? I think we should stick with OAuth2 for now...",
"actor": "charlie",
"url": "https://gitlab.com/group/repo/-/issues/234#note_12345",
"is_seed": true,
"details": {
"note_id": 12345,
"snippet": "Should we support SAML too? I think we should stick with OAuth2 for now..."
}
}
]
},
"meta": {
"search_mode": "lexical",
"expansion_depth": 1,
"expand_mentions": false,
"total_entities": 3,
"total_events": 12,
"evidence_notes_included": 4,
"unresolved_references": 1
}
}
```
### 3.6 Acceptance Criteria
- [ ] `lore timeline <query>` returns chronologically ordered events
- [ ] Seed entities found via FTS5 keyword search (issues, MRs, and notes)
- [ ] State, label, and milestone events interleaved from resource event tables
- [ ] Entity creation and merge events included
- [ ] Evidence-bearing notes included as `note_evidence` events (top FTS5 matches, bounded default 10)
- [ ] Cross-reference expansion follows `entity_references` to configurable depth
- [ ] Default expansion follows `closes` + `related` edges; `--expand-mentions` adds `mentioned` edges
- [ ] `--depth 0` disables expansion
- [ ] `--since` filters by event timestamp
- [ ] `-p` scopes to project
- [ ] Human output is colored and readable
- [ ] Robot mode returns structured JSON with expansion provenance (`via`) for expanded entities
- [ ] Unresolved (external) references included in JSON output
---
## Gate 4: File Decision History (`lore file-history`)
### 4.1 Schema
**Commit SHAs (Migration 015 — already applied):**
`merge_commit_sha` and `squash_commit_sha` were added to `merge_requests` in migration 015. These are now populated during MR ingestion and available for Gate 4/5 queries.
**File changes table (future migration — not yet created):**
```sql
-- Files changed by each merge request
-- Source: GET /projects/:id/merge_requests/:iid/diffs
CREATE TABLE mr_file_changes (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER NOT NULL REFERENCES merge_requests(id) ON DELETE CASCADE,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
old_path TEXT, -- NULL for new files
new_path TEXT NOT NULL,
change_type TEXT NOT NULL CHECK (change_type IN ('added', 'modified', 'deleted', 'renamed')),
UNIQUE(merge_request_id, new_path)
);
CREATE INDEX idx_mr_files_new_path ON mr_file_changes(new_path);
CREATE INDEX idx_mr_files_old_path ON mr_file_changes(old_path)
WHERE old_path IS NOT NULL;
CREATE INDEX idx_mr_files_mr ON mr_file_changes(merge_request_id);
```
### 4.2 Config Extension
```json
{
"sync": {
"fetchMrFileChanges": true
}
}
```
Opt-in. When enabled, the sync pipeline fetches `GET /projects/:id/merge_requests/:iid/diffs` for each changed MR and extracts file metadata. Diff content is **not stored** — only file paths and change types.
### 4.3 Ingestion
**Uses the generic dependent fetch queue (`job_type = 'mr_diffs'`):**
1. After MR ingestion, if `fetchMrFileChanges` is true, enqueue a `mr_diffs` job in `pending_dependent_fetches`.
2. Parse response: `changes[].{old_path, new_path, new_file, renamed_file, deleted_file}`.
3. Derive `change_type`:
- `new_file == true``'added'`
- `renamed_file == true``'renamed'`
- `deleted_file == true``'deleted'`
- else → `'modified'`
4. Upsert into `mr_file_changes`. On re-sync, DELETE existing rows for the MR and re-insert (diffs can change if MR is rebased).
**API call cost:** 1 additional call per MR. Acceptable for incremental sync (1050 MRs/day).
### 4.4 Command Design
```bash
# Show decision history for a file
lore file-history src/auth/oauth.rs
# Scoped to project (required if file path exists in multiple projects)
lore file-history src/auth/oauth.rs -p group/repo
# Include discussions on the MRs
lore file-history src/auth/oauth.rs --discussions
# Follow rename chains (default: on)
lore file-history src/auth/oauth.rs # follows renames automatically
lore file-history src/auth/oauth.rs --no-follow-renames # disable rename chain resolution
# Limit results
lore file-history src/auth/oauth.rs -n 10
# Filter to merged MRs only
lore file-history src/auth/oauth.rs --merged
# Robot mode
lore -J file-history src/auth/oauth.rs
```
### 4.5 Query Logic
```sql
SELECT
mr.iid,
mr.title,
mr.state,
mr.author_username,
mr.merged_at,
mr.created_at,
mr.web_url,
mr.merge_commit_sha,
mfc.change_type,
mfc.old_path,
(SELECT COUNT(*) FROM discussions d
WHERE d.merge_request_id = mr.id) AS discussion_count,
(SELECT COUNT(*) FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE d.merge_request_id = mr.id
AND n.position_new_path = ?1) AS file_discussion_count
FROM mr_file_changes mfc
JOIN merge_requests mr ON mr.id = mfc.merge_request_id
WHERE mfc.new_path = ?1 OR mfc.old_path = ?1
ORDER BY COALESCE(mr.merged_at, mr.created_at) DESC;
```
For each MR, optionally fetch related issues via `entity_references` (Gate 2 data).
### 4.6 Rename Handling
File renames are tracked via `old_path` and resolved as bounded chains:
1. Start with the query path in the path set: `{src/auth/oauth.rs}`
2. Search `mr_file_changes` for rows where `change_type = 'renamed'` and either `new_path` or `old_path` is in the path set
3. Add the other side of each rename to the path set
4. Repeat until no new paths are discovered, up to a maximum of 10 hops (configurable)
5. Use the full path set for the file history query
**Safeguards:**
- Hop cap (default 10) prevents runaway expansion
- Cycle detection: if a path is already in the set, skip it
- The unioned path set is used for matching MRs in the main query
**Output:**
- Human mode annotates the rename chain: `"src/auth/oauth.rs (renamed from src/auth/handler.rs ← src/auth.rs)"`
- Robot mode JSON includes `rename_chain`: `["src/auth.rs", "src/auth/handler.rs", "src/auth/oauth.rs"]`
- `--no-follow-renames` disables chain resolution (matches only the literal path provided)
### 4.7 Acceptance Criteria
- [ ] `mr_file_changes` table populated from GitLab diffs API
- [ ] `merge_commit_sha` and `squash_commit_sha` captured in `merge_requests`
- [ ] `lore file-history <path>` returns MRs ordered by merge/creation date
- [ ] Output includes: MR title, state, author, change type, discussion count
- [ ] `--discussions` shows inline discussion snippets from DiffNotes on the file
- [ ] Rename chains resolved with bounded hop count (default 10) and cycle detection
- [ ] `--no-follow-renames` disables chain resolution
- [ ] Robot mode JSON includes `rename_chain` when renames are detected
- [ ] Robot mode JSON output
- [ ] `-p` required when path exists in multiple projects (Ambiguous error)
---
## Gate 5: Code Trace (`lore trace`)
### 5.1 Overview
`lore trace` answers "Why was this code introduced?" by tracing from a file (and optionally a line number) back through the MR and issue that motivated the change.
### 5.2 Two-Tier Architecture
**Tier 1 — API-only (no local git required):**
Uses `merge_commit_sha` and `squash_commit_sha` from the `merge_requests` table to link MRs to commits. Combined with `mr_file_changes`, this can answer "which MRs touched this file" and link to their motivating issues via `entity_references`.
This is equivalent to `lore file-history` enriched with issue context — effectively a file-scoped decision timeline.
**Tier 2 — Git integration (requires local clone):**
Uses `git blame` to map a specific line to a commit SHA, then resolves the commit to an MR via `merge_commit_sha` lookup. This provides line-level precision.
**Gate 5 ships Tier 1 only.** Tier 2 (git integration via `git2-rs`) is a future enhancement.
### 5.3 Command Design
```bash
# Trace a file's history (Tier 1: API-only)
lore trace src/auth/oauth.rs
# Trace a specific line (Tier 2: requires local git)
lore trace src/auth/oauth.rs:45
# Robot mode
lore -J trace src/auth/oauth.rs
```
### 5.4 Query Flow (Tier 1)
```
1. Find MRs that touched this file (mr_file_changes)
2. For each MR, find related issues (entity_references WHERE reference_type = 'closes')
3. For each issue, fetch discussions with rationale
4. Build trace chain: file → MR → issue → discussions
5. Order by merge date (most recent first)
```
### 5.5 Output Format (Human)
```
lore trace src/auth/oauth.rs
Trace: src/auth/oauth.rs
────────────────────────
!567 feat: add OAuth2 provider MERGED 2024-03-25
→ Closes #234: Migrate to OAuth2
→ 12 discussion comments, 4 on this file
→ Decision: Use rust-oauth2 crate (discussed in #234, comment by @alice)
!612 fix: token refresh race condition MERGED 2024-04-10
→ Closes #299: OAuth2 login fails for SSO users
→ 5 discussion comments, 2 on this file
→ [src/auth/oauth.rs:45] "Add mutex around refresh to prevent double-refresh"
!701 refactor: extract TokenManager MERGED 2024-05-01
→ Related: #312: Reduce auth module complexity
→ 3 discussion comments
→ Note: file was renamed from src/auth/handler.rs
```
### 5.6 Tier 2 Design Notes (Future — Not in This Phase)
When git integration is added:
1. Add `git2-rs` dependency for native git operations
2. Implement `git blame -L <line>,<line> <file>` to get commit SHA for a specific line
3. Look up commit SHA in `merge_requests.merge_commit_sha` or `merge_requests.squash_commit_sha`
4. If no match (commit was squashed), search `merge_commit_sha` for commits in the blame range
5. Optional `blame_cache` table for performance (invalidated by content hash)
**Known limitation:** Squash commits break blame-to-MR mapping for individual commits within an MR. The squash commit SHA maps to the MR, but all lines show the same commit. This is a fundamental Git limitation documented in [GitLab Forum #77146](https://forum.gitlab.com/t/preserve-blame-in-squash-merge/77146).
### 5.7 Acceptance Criteria (Tier 1 Only)
- [ ] `lore trace <file>` shows MRs that touched the file with linked issues and discussion context
- [ ] Output includes the MR → issue → discussion chain
- [ ] Discussion snippets show DiffNote content on the traced file
- [ ] Cross-references from `entity_references` used for MR→issue linking
- [ ] Robot mode JSON output
- [ ] Graceful handling when no MR data found ("Run `lore sync` with `fetchMrFileChanges: true`")
---
## Migration Strategy
### Migration Numbering
Phase B uses migration numbers 011015. The original plan assumed migration 010 was available, but chunk config (`010_chunk_config.sql`) was implemented first, shifting everything by +1.
| Migration | File | Content | Gate |
|-----------|------|---------|------|
| 011 | `011_resource_events.sql` | Resource event tables (state, label, milestone), entity_references, generic dependent fetch queue | Gates 1, 2 |
| 012 | `012_nullable_label_milestone.sql` | Make `label_name` and `milestone_title` nullable for deleted labels/milestones | Gate 1 (fix) |
| 013 | `013_resource_event_watermarks.sql` | Add `resource_events_synced_for_updated_at` to issues and merge_requests | Gate 1 (optimization) |
| 014 | `014_sync_runs_enrichment.sql` | Observability: `run_id`, `total_items_processed`, `total_errors` on sync_runs | Observability |
| 015 | `015_commit_shas_and_closes_watermark.sql` | `merge_commit_sha`, `squash_commit_sha`, `closes_issues_synced_for_updated_at` on merge_requests; `idx_label_events_label` index | Gates 2, 4 |
| TBD | — | `mr_file_changes` table for MR diff data | Gate 4 |
### Backward Compatibility
- All new tables are additive (no ALTER on existing data-bearing columns)
- `lore sync` works without event data — temporal commands gracefully report "No event data. Run `lore sync` to populate."
- Existing search, issues, mrs commands are unaffected
---
## Risks and Mitigations
### Identified During Premortem
| Risk | Severity | Mitigation |
|------|----------|------------|
| API call volume explosion (3 event calls per entity) | Medium | Incremental sync limits to changed entities; opt-in config flag |
| System note parsing fragile for non-English instances | Medium | Used only for assignee changes and cross-refs; `source_method` tracks provenance |
| GitLab diffs API returns large payloads | Low | Extract file metadata only, discard diff content |
| Cross-reference graph traversal unbounded | Medium | BFS depth capped at configurable limit (default 1); `mentioned` edges excluded by default |
| Cross-project references lost when target not synced | Medium | Unresolved references stored with `target_entity_id = NULL`; still appear in timeline output |
| Phase A migration numbering conflict | Low | Resolved: chunk config took 010; Phase B shifted to 011-015 |
| Timeline output lacks "why" evidence | Medium | Evidence-bearing notes from FTS5 included as first-class timeline events |
| Squash commits break blame-to-MR mapping | Medium | Tier 2 (git integration) deferred; Tier 1 uses file-level MR matching |
### Accepted Limitations
- **No real-time monitoring.** Phase B is batch queries over historical data. "Notify me when my code changes" requires a different architecture (webhooks, polling daemon) and is out of scope.
- **No pattern evolution.** Cross-project trend detection requires all of Phase B's infrastructure plus semantic clustering. Deferred to Phase C.
- **English-only system note parsing.** Cross-reference extraction from system notes works reliably only for English-language GitLab instances. Structured API data works for all languages.
- **Bounded rename chain resolution.** `lore file-history` resolves rename chains up to 10 hops with cycle detection. Pathological rename histories (>10 hops) are truncated.
- **Evidence notes are keyword-matched, not summarized.** Timeline evidence notes are the raw FTS5-matched note text, not AI-generated summaries. This keeps the system deterministic and avoids LLM dependencies.
---
## Success Metrics
| Metric | Target |
|--------|--------|
| `lore timeline` query latency | < 200ms for typical queries (< 50 seed entities) |
| Timeline event coverage | State + label + creation + merge + evidence note events for all synced entities |
| Timeline evidence quality | Top 10 FTS5-matched notes included per query; at least 1 evidence note for queries matching discussion-bearing entities |
| Cross-reference coverage | > 80% of "closed by MR" relationships captured via structured API |
| Unresolved reference capture | Cross-project references stored even when target project is not synced |
| Incremental sync overhead | < 5% increase in sync time for event fetching |
| `lore file-history` coverage | File changes captured for all synced MRs (when opt-in enabled) |
| Rename chain resolution | Multi-hop renames correctly resolved up to 10 hops |
---
## Future Phases (Out of Scope)
### Phase C: Advanced Temporal Features
- Pattern Evolution: cross-project trend detection via embedding clusters
- Git integration (Tier 2): `git blame` → commit → MR resolution
- MCP server: expose `timeline`, `file-history`, `trace` as typed MCP tools
### Phase D: Consumer Applications
- Web UI: separate frontend consuming lore's JSON API via `lore serve`
- Real-time monitoring: webhook listener or polling daemon for change notifications
- IDE integration: editor plugins surfacing temporal context inline