Files

Taylor Eernisse a417640faa docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan

AGENTS.md: Comprehensive rewrite adding file deletion safeguards,
destructive git command protocol, Rust toolchain conventions, code
editing discipline rules, compiler check requirements, TDD mandate,
MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass
tool documentation, and session completion workflow.

README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration,
project resolution cascading match logic, lore health and robot-docs
commands, exit codes 17 (not found) and 18 (ambiguous match),
--color/--quiet global flags, dirty_sources and
pending_discussion_fetches tables, and version command git hash output.

docs/embedding-pipeline-hardening.md: Detailed spec covering the three
problems from the chunk size reduction (broken --full wiring, mixed
chunk sizes in vector space, static dedup multiplier) with decision
records, implementation plan, and acceptance criteria.

docs/phase-b-temporal-intelligence.md: Draft planning document for
transforming gitlore from a search engine into a temporal code
intelligence system by ingesting structured event data from GitLab.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-03 09:35:51 -05:00

43 KiB

Raw Blame History

Phase B: Temporal Intelligence Foundation

Status: Draft Prerequisite: CP3 Gates B+C complete (working search + sync pipeline) Goal: Transform gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab and exposing temporal queries that answer "why" and "when" questions about project history.

Motivation

gitlore currently stores snapshots — the latest state of each issue, MR, and discussion. But temporal queries need change history. When an issue's labels change from priority::low to priority::critical, the current schema overwrites the label junction. The transition is lost.

GitLab issues, MRs, and discussions contain the raw ingredients for temporal intelligence: state transitions, label mutations, assignee changes, cross-references between entities, and decision rationale in discussions. What's missing is a structured temporal index that makes these ingredients queryable.

The Problem This Solves

Today, when an AI agent or developer asks "Why did the team switch from REST to GraphQL?" or "What happened with the auth migration?", the answer is scattered across paginated API responses with no temporal index, no cross-referencing, and no semantic layer. Reconstructing a decision timeline manually takes 20+ minutes of clicking through GitLab's UI. This phase makes it take 2 seconds.

Forcing Function

This phase is designed around one concrete question: "What happened with X?" — where X is any keyword, feature name, or initiative. If lore timeline "auth migration" can produce a useful, chronologically-ordered narrative of all related events across issues, MRs, and discussions, the architecture is validated. If it can't, we learn what's missing before investing in deeper temporal features.

Executive Summary (Gated Milestones)

Five gates, each independently verifiable and shippable:

Gate 1 (Resource Events Ingestion): Structured event data from GitLab APIs → local event tables Gate 2 (Cross-Reference Extraction): Entity relationship graph from structured APIs + system note parsing Gate 3 (Decision Timeline): lore timeline command — keyword-driven chronological narrative Gate 4 (File Decision History): lore file-history command — MR-to-file linking + scoped timelines Gate 5 (Code Trace): lore trace command — file:line → commit → MR → issue → rationale chain

Key Design Decisions

Structured APIs over text parsing. GitLab provides Resource Events APIs (resource_state_events, resource_label_events, resource_milestone_events) that return clean JSON. These are the primary data source for temporal events. System note parsing is a fallback for events without structured APIs (assignee changes, cross-references).
Dependent resource pattern. Resource events are fetched per-entity, triggered by the existing dirty source tracking. Same architecture as discussion fetching — queue-based, resumable, incremental.
Opt-in event ingestion. New config flag sync.fetchResourceEvents (default true) controls whether the sync pipeline fetches event data. Users who don't need temporal features skip the additional API calls.
Application-level graph traversal. Cross-reference expansion uses BFS in Rust, not recursive SQL CTEs. Capped at configurable depth (default 1) for predictable performance.
Evolutionary library extraction. New commands are built with typed return structs from day one. Old commands are not retrofitted until a concrete consumer (MCP server, web UI) requires it.
Phase A fields cherry-picked as needed. merge_commit_sha and squash_commit_sha are added in this phase's migration. Remaining Phase A fields are handled in their own migration later.

Scope Boundaries

In scope:

Batch temporal queries over historical data
Structured event ingestion from GitLab APIs
Cross-reference graph construction
CLI commands with robot mode JSON output

Out of scope (future phases):

Real-time monitoring / notifications ("alert me when my code changes")
MCP server (Phase C — consumes the library API this phase produces)
Web UI (Phase D — consumes the same library API)
Pattern evolution / cross-project trend detection (Phase C)
Library extraction refactor (happens organically as new commands are added)

Gate 1: Resource Events Ingestion

1.1 Rationale: Why Not Parse System Notes?

The original approach was to parse system note body text with regex to extract state changes and label mutations. Research revealed this is the wrong approach:

Structured APIs exist. GitLab's Resource Events APIs return clean JSON with explicit action, state, and label fields. Available on all tiers (Free, Premium, Ultimate).
System notes are localized. A French GitLab instance says "ajouté l'étiquette ~bug" — regex breaks for non-English instances.
Label events aren't in the Notes API. Per GitLab Issue #24661, label change system notes are not returned by the Notes API. The Resource Label Events API is the only reliable source.
No versioned format spec. System note text has changed across GitLab 14.x–17.x with no documentation of format changes.

System note parsing is still used for events without structured APIs (see Gate 2), but with the explicit understanding that it's best-effort and fragile for non-English instances.

1.2 Schema (Migration 010)

File: migrations/010_resource_events.sql

-- State change events (opened, closed, reopened, merged, locked)
-- Source: GET /projects/:id/issues/:iid/resource_state_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_state_events
CREATE TABLE resource_state_events (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
  merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
  state TEXT NOT NULL,                -- 'opened' | 'closed' | 'reopened' | 'merged' | 'locked'
  actor_gitlab_id INTEGER,            -- GitLab user ID (stable; usernames can change)
  actor_username TEXT,                -- display/search convenience
  created_at INTEGER NOT NULL,        -- ms epoch UTC
  -- "closed by MR" link: structured by GitLab, not parsed from text
  source_merge_request_id INTEGER,    -- GitLab's MR iid that caused this state change
  source_commit TEXT,                 -- commit SHA that caused this state change
  UNIQUE(gitlab_id, project_id),
  CHECK (
    (issue_id IS NOT NULL AND merge_request_id IS NULL)
    OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
  )
);

CREATE INDEX idx_state_events_issue ON resource_state_events(issue_id)
  WHERE issue_id IS NOT NULL;
CREATE INDEX idx_state_events_mr ON resource_state_events(merge_request_id)
  WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_state_events_created ON resource_state_events(created_at);

-- Label change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_label_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_label_events
CREATE TABLE resource_label_events (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
  merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
  label_name TEXT NOT NULL,
  action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
  actor_gitlab_id INTEGER,            -- GitLab user ID (stable; usernames can change)
  actor_username TEXT,                -- display/search convenience
  created_at INTEGER NOT NULL,        -- ms epoch UTC
  UNIQUE(gitlab_id, project_id),
  CHECK (
    (issue_id IS NOT NULL AND merge_request_id IS NULL)
    OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
  )
);

CREATE INDEX idx_label_events_issue ON resource_label_events(issue_id)
  WHERE issue_id IS NOT NULL;
CREATE INDEX idx_label_events_mr ON resource_label_events(merge_request_id)
  WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_label_events_created ON resource_label_events(created_at);
CREATE INDEX idx_label_events_label ON resource_label_events(label_name);

-- Milestone change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_milestone_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_milestone_events
CREATE TABLE resource_milestone_events (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
  merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
  milestone_title TEXT NOT NULL,
  milestone_id INTEGER,
  action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
  actor_gitlab_id INTEGER,            -- GitLab user ID (stable; usernames can change)
  actor_username TEXT,                -- display/search convenience
  created_at INTEGER NOT NULL,        -- ms epoch UTC
  UNIQUE(gitlab_id, project_id),
  CHECK (
    (issue_id IS NOT NULL AND merge_request_id IS NULL)
    OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
  )
);

CREATE INDEX idx_milestone_events_issue ON resource_milestone_events(issue_id)
  WHERE issue_id IS NOT NULL;
CREATE INDEX idx_milestone_events_mr ON resource_milestone_events(merge_request_id)
  WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_milestone_events_created ON resource_milestone_events(created_at);

1.3 Config Extension

File: src/core/config.rs

Add to SyncConfig:

/// Fetch resource events (state, label, milestone changes) during sync.
/// Increases API calls but enables temporal queries (lore timeline, etc.).
/// Default: true
#[serde(default = "default_true")]
pub fetch_resource_events: bool,

Config file example:

{
  "sync": {
    "fetchResourceEvents": true
  }
}

1.4 GitLab API Client

New endpoints in src/gitlab/client.rs:

GET /projects/:id/issues/:iid/resource_state_events?per_page=100
GET /projects/:id/issues/:iid/resource_label_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_state_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_label_events?per_page=100
GET /projects/:id/issues/:iid/resource_milestone_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_milestone_events?per_page=100

All endpoints use standard pagination. Fetch all pages per entity.

New serde types in src/gitlab/types.rs:

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabStateEvent {
    pub id: i64,
    pub user: Option<GitLabAuthor>,
    pub created_at: String,
    pub resource_type: String,       // "Issue" | "MergeRequest"
    pub resource_id: i64,
    pub state: String,               // "opened" | "closed" | "reopened" | "merged" | "locked"
    pub source_commit: Option<String>,
    pub source_merge_request: Option<GitLabMergeRequestRef>,
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabLabelEvent {
    pub id: i64,
    pub user: Option<GitLabAuthor>,
    pub created_at: String,
    pub resource_type: String,
    pub resource_id: i64,
    pub label: GitLabLabelRef,
    pub action: String,              // "add" | "remove"
}

#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabMilestoneEvent {
    pub id: i64,
    pub user: Option<GitLabAuthor>,
    pub created_at: String,
    pub resource_type: String,
    pub resource_id: i64,
    pub milestone: GitLabMilestoneRef,
    pub action: String,              // "add" | "remove"
}

1.5 Ingestion Pipeline

Architecture: Generic dependent-fetch queue, generalizing the pending_discussion_fetches pattern. A single queue table serves all dependent resource types across Gates 1, 2, and 4, avoiding schema churn as new fetch types are added.

New queue table (in migration 010):

-- Generic queue for all dependent resource fetches (events, closes_issues, diffs)
-- Replaces per-type queue tables with a unified job model
CREATE TABLE pending_dependent_fetches (
  id INTEGER PRIMARY KEY,
  project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  entity_type TEXT NOT NULL CHECK (entity_type IN ('issue', 'merge_request')),
  entity_iid INTEGER NOT NULL,
  entity_local_id INTEGER NOT NULL,
  job_type TEXT NOT NULL CHECK (job_type IN (
    'resource_events',      -- Gate 1: state + label + milestone events
    'mr_closes_issues',     -- Gate 2: closes_issues API
    'mr_diffs'              -- Gate 4: MR file changes
  )),
  payload_json TEXT,          -- job-specific params, e.g. {"event_types":["state","label","milestone"]}
  enqueued_at INTEGER NOT NULL,
  attempts INTEGER NOT NULL DEFAULT 0,
  last_error TEXT,
  next_retry_at INTEGER,
  locked_at INTEGER,          -- crash recovery: NULL = available, non-NULL = in progress
  UNIQUE(project_id, entity_type, entity_iid, job_type)
);

The locked_at column provides crash recovery: if a sync process crashes mid-drain, stale locks (older than 5 minutes) are automatically reclaimed on the next lore sync run. This is intentionally minimal — full job leasing with locked_by and lease expiration is unnecessary for a single-process CLI tool.

Flow:

During issue/MR ingestion, when an entity is upserted (new or updated), enqueue jobs in pending_dependent_fetches:
- For all entities: job_type = 'resource_events' (when fetchResourceEvents is true)
- For MRs: job_type = 'mr_closes_issues' (always, for Gate 2)
- For MRs: job_type = 'mr_diffs' (when fetchMrFileChanges is true, for Gate 4)
After primary ingestion completes, drain the dependent fetch queue:
- Claim jobs: UPDATE ... SET locked_at = now WHERE locked_at IS NULL AND (next_retry_at IS NULL OR next_retry_at <= now)
- For each job, dispatch by job_type to the appropriate fetcher
- On success: DELETE the job row
- On transient failure: increment attempts, set next_retry_at with exponential backoff, clear locked_at
lore sync drains dependent jobs after ingestion + discussion fetch steps.

Incremental behavior: Only entities that changed since last sync are enqueued. On --full sync, all entities are re-enqueued.

1.6 API Call Budget

Per entity: 3 API calls (state + label + milestone) for issues, 3 for MRs.

Scenario	Entities	API Calls	Time at 2k req/min
Initial sync, 500 issues + 200 MRs	700	2,100	~1 min
Initial sync, 2,000 issues + 1,000 MRs	3,000	9,000	~4.5 min
Incremental sync, 20 changed entities	20	60	<2 sec

Acceptable for initial sync. Incremental sync adds negligible overhead.

Optimization (future): If milestone events prove low-value, make them opt-in to reduce calls by 1/3.

1.7 Acceptance Criteria

Migration 010 creates all three event tables + generic dependent fetch queue
lore sync fetches resource events for changed entities when fetchResourceEvents is true
lore sync --no-events skips event fetching
Event fetch failures are queued for retry with exponential backoff
Stale locks (crashed sync) automatically reclaimed on next run
lore count events shows event counts by type
lore stats --check validates event table referential integrity
lore stats --check validates dependent job queue health (no stuck locks, retryable jobs visible)
Robot mode JSON for all new commands

Gate 2: Cross-Reference Extraction

2.1 Rationale

Temporal queries need to follow links between entities: "MR !567 closed issue #234", "issue #234 mentioned in MR !567", "#299 was opened as a follow-up to !567". These relationships are captured in two places:

Structured API: GET /projects/:id/merge_requests/:iid/closes_issues returns issues that close when the MR merges. Also, resource_state_events includes source_merge_request_id for "closed by MR" events.
System notes: Cross-references like "mentioned in !456" and "closed by !789" appear in system note body text.

2.2 Schema (in Migration 010)

-- Cross-references between entities
-- Populated from: closes_issues API, state events, system note parsing
--
-- Directionality convention:
--   source = the entity where the reference was *observed* (contains the note, or is the MR in closes_issues)
--   target = the entity being *referenced* (the issue closed, the MR mentioned)
--   This is consistent across all source_methods and enables predictable BFS traversal.
--
-- Unresolved references: when a cross-reference points to an entity in a project
-- that isn't synced locally, target_entity_id is NULL but target_project_path and
-- target_entity_iid are populated. This preserves valuable edges rather than
-- silently dropping them. Timeline output marks these as "[external]".
CREATE TABLE entity_references (
  id INTEGER PRIMARY KEY,
  source_entity_type TEXT NOT NULL CHECK (source_entity_type IN ('issue', 'merge_request')),
  source_entity_id INTEGER NOT NULL,   -- local DB id
  target_entity_type TEXT NOT NULL CHECK (target_entity_type IN ('issue', 'merge_request')),
  target_entity_id INTEGER,            -- local DB id (NULL when target is unresolved/external)
  target_project_path TEXT,            -- e.g. "group/other-repo" (populated for cross-project refs)
  target_entity_iid INTEGER,           -- GitLab iid (populated when target_entity_id is NULL)
  reference_type TEXT NOT NULL,        -- 'closes' | 'mentioned' | 'related'
  source_method TEXT NOT NULL,         -- 'api_closes_issues' | 'api_state_event' | 'system_note_parse'
  created_at INTEGER,                  -- when the reference was created (if known)
  UNIQUE(source_entity_type, source_entity_id, target_entity_type,
         COALESCE(target_entity_id, -1), COALESCE(target_project_path, ''),
         COALESCE(target_entity_iid, -1), reference_type)
);

CREATE INDEX idx_refs_source ON entity_references(source_entity_type, source_entity_id);
CREATE INDEX idx_refs_target ON entity_references(target_entity_type, target_entity_id)
  WHERE target_entity_id IS NOT NULL;
CREATE INDEX idx_refs_unresolved ON entity_references(target_project_path, target_entity_iid)
  WHERE target_entity_id IS NULL;

2.3 Population Strategy

Tier 1 — Structured APIs (reliable):

closes_issues endpoint: After MR ingestion, fetch GET /projects/:id/merge_requests/:iid/closes_issues. Insert reference_type = 'closes', source_method = 'api_closes_issues'. Source = MR, target = issue.
State events: When resource_state_events contains source_merge_request_id, insert reference_type = 'closes', source_method = 'api_state_event'. Source = MR (referenced by iid), target = issue (that received the state change).

Tier 2 — System note parsing (best-effort):

Parse system notes where is_system = 1 for cross-reference patterns.

Directionality rule: Source = entity containing the system note. Target = entity referenced by the note text. This is consistent with Tier 1's convention.

mentioned in !{iid}
mentioned in #{iid}
mentioned in {group}/{project}!{iid}
mentioned in {group}/{project}#{iid}
closed by !{iid}
closed by #{iid}

Cross-project references: When a system note references {group}/{project}#{iid} and the target project is not synced locally, store with target_entity_id = NULL, target_project_path = '{group}/{project}', target_entity_iid = {iid}. These unresolved references are still valuable for timeline narratives — they indicate external dependencies and decision context even when we can't traverse further.

Insert with source_method = 'system_note_parse'. Accept that:

This breaks on non-English GitLab instances
Format may vary across GitLab versions
Log parse failures at debug level for monitoring

Tier 3 — Description/body parsing (deferred):

Issue and MR descriptions often contain #123 or !456 references. Parsing these is lower confidence (mentions != relationships) and is deferred to a future iteration.

2.4 Ingestion Flow

The closes_issues fetch uses the generic dependent fetch queue (job_type = 'mr_closes_issues'):

After MR ingestion, a mr_closes_issues job is enqueued alongside resource_events jobs
One additional API call per MR: GET /projects/:id/merge_requests/:iid/closes_issues
Cross-reference parsing from system notes runs as a local post-processing step (no API calls) after all dependent fetches complete

2.5 Acceptance Criteria

entity_references table populated from closes_issues API for all synced MRs
entity_references table populated from resource_state_events where source_merge_request_id is present
System notes parsed for cross-reference patterns (English instances)
Cross-project references stored as unresolved when target project is not synced
source_method column tracks provenance of each reference
References are deduplicated (same relationship from multiple sources stored once)
Timeline JSON includes expansion provenance (via) for all expanded entities

Gate 3: Decision Timeline (`lore timeline`)

3.1 Command Design

# Basic: keyword-driven timeline
lore timeline "auth migration"

# Scoped to project
lore timeline "auth migration" -p group/repo

# Limit date range
lore timeline "auth migration" --since 6m
lore timeline "auth migration" --since 2024-01-01

# Control cross-reference expansion depth
lore timeline "auth migration" --depth 0    # No expansion (matched entities only)
lore timeline "auth migration" --depth 1    # Follow direct references (default)
lore timeline "auth migration" --depth 2    # Two hops

# Control which edge types are followed during expansion
lore timeline "auth migration" --expand-mentions   # Also follow 'mentioned' edges (off by default)
# Default expansion follows 'closes' and 'related' edges only.
# 'mentioned' edges are excluded by default because they have high fan-out
# and often connect tangentially related entities.

# Limit results
lore timeline "auth migration" -n 50

# Robot mode
lore -J timeline "auth migration"

3.2 Query Flow

1. SEED: FTS5 keyword search → matched document IDs (issues, MRs, and notes/discussions)
   ↓
2. HYDRATE:
   - Map document IDs → source entities (issues, MRs)
   - Collect top matched notes as evidence candidates (bounded, default top 10)
     These are the actual decision-bearing comments that answer "why"
   ↓
3. EXPAND: Follow entity_references (BFS, depth-limited)
   → Discover related entities not matched by keywords
   → Default: follow 'closes' + 'related' edges; skip 'mentioned' unless --expand-mentions
   → Unresolved (external) references included in output but not traversed further
   ↓
4. COLLECT EVENTS: For all entities (seed + expanded):
   - Entity creation (created_at from issues/merge_requests)
   - State changes (resource_state_events)
   - Label changes (resource_label_events)
   - Milestone changes (resource_milestone_events)
   - Evidence notes: top FTS5-matched notes as discrete events (snippet + author + url)
   - Merge events (merged_at from merge_requests)
   ↓
5. INTERLEAVE: Sort all events chronologically
   ↓
6. RENDER: Format as timeline (human or JSON)

Why evidence notes instead of "discussion activity summarized": The forcing function is "What happened with X?" A timeline entry that says "3 new comments" doesn't answer why — it answers how many. By including the top FTS5-matched notes as first-class timeline events, the timeline surfaces the actual decision rationale, code review feedback, and architectural reasoning that motivated changes. This uses the existing search infrastructure (CP3) with no new indexing required.

3.3 Event Model

The timeline doesn't store a separate unified event table. Instead, it queries across the existing tables at read time and produces a virtual event stream:

pub struct TimelineEvent {
    pub timestamp: i64,              // ms epoch
    pub entity_type: String,         // "issue" | "merge_request" | "discussion"
    pub entity_iid: i64,
    pub project_path: String,
    pub event_type: TimelineEventType,
    pub summary: String,             // human-readable one-liner
    pub actor: Option<String>,       // username
    pub url: Option<String>,
    pub is_seed: bool,               // matched by keyword (vs. expanded via reference)
}

pub enum TimelineEventType {
    Created,                         // entity opened/created
    StateChanged { state: String },  // closed, reopened, merged, locked
    LabelAdded { label: String },
    LabelRemoved { label: String },
    MilestoneSet { milestone: String },
    MilestoneRemoved { milestone: String },
    Merged,
    NoteEvidence {                    // FTS5-matched note surfacing decision rationale
        note_id: i64,
        snippet: String,             // first ~200 chars of the matching note body
        discussion_id: Option<i64>,
    },
    CrossReferenced { target: String },
}

3.4 Human Output Format

lore timeline "auth migration"

Timeline: "auth migration" (12 events across 4 entities)
───────────────────────────────────────────────────────

2024-03-15  CREATED   #234  Migrate to OAuth2                    @alice
            Labels: ~auth, ~breaking-change
2024-03-18  CREATED   !567  feat: add OAuth2 provider            @bob
            References: #234
2024-03-20  NOTE      #234  "Should we support SAML too? I think  @charlie
                             we should stick with OAuth2 for now..."
2024-03-22  LABEL     !567  added ~security-review               @alice
2024-03-24  NOTE      !567  [src/auth/oauth.rs:45]               @dave
                             "Consider refresh token rotation to
                              prevent session fixation attacks"
2024-03-25  MERGED    !567  feat: add OAuth2 provider            @alice
2024-03-26  CLOSED    #234  closed by !567                       @alice
2024-03-28  CREATED   #299  OAuth2 login fails for SSO users     @dave   [expanded]
            (via !567, closes)

───────────────────────────────────────────────────────
Seed entities: #234, !567 | Expanded: #299 (depth 1, via !567)

Entities discovered via cross-reference expansion are marked [expanded] with a compact provenance note showing which seed entity and edge type led to their discovery.

Evidence notes (NOTE events) show the first ~200 characters of FTS5-matched note bodies. These are the actual decision-bearing comments that answer "why" — not just activity counts.

3.5 Robot Mode JSON

{
  "ok": true,
  "data": {
    "query": "auth migration",
    "event_count": 12,
    "seed_entities": [
      { "type": "issue", "iid": 234, "project": "group/repo" },
      { "type": "merge_request", "iid": 567, "project": "group/repo" }
    ],
    "expanded_entities": [
      {
        "type": "issue",
        "iid": 299,
        "project": "group/repo",
        "depth": 1,
        "via": {
          "from": { "type": "merge_request", "iid": 567, "project": "group/repo" },
          "reference_type": "closes",
          "source_method": "api_closes_issues"
        }
      }
    ],
    "unresolved_references": [
      {
        "source": { "type": "merge_request", "iid": 567, "project": "group/repo" },
        "target_project": "group/other-repo",
        "target_type": "issue",
        "target_iid": 42,
        "reference_type": "mentioned"
      }
    ],
    "events": [
      {
        "timestamp": "2024-03-15T10:00:00Z",
        "entity_type": "issue",
        "entity_iid": 234,
        "project": "group/repo",
        "event_type": "created",
        "summary": "Migrate to OAuth2",
        "actor": "alice",
        "url": "https://gitlab.com/group/repo/-/issues/234",
        "is_seed": true,
        "details": {
          "labels": ["auth", "breaking-change"]
        }
      },
      {
        "timestamp": "2024-03-20T14:30:00Z",
        "entity_type": "issue",
        "entity_iid": 234,
        "project": "group/repo",
        "event_type": "note_evidence",
        "summary": "Should we support SAML too? I think we should stick with OAuth2 for now...",
        "actor": "charlie",
        "url": "https://gitlab.com/group/repo/-/issues/234#note_12345",
        "is_seed": true,
        "details": {
          "note_id": 12345,
          "snippet": "Should we support SAML too? I think we should stick with OAuth2 for now..."
        }
      }
    ]
  },
  "meta": {
    "search_mode": "lexical",
    "expansion_depth": 1,
    "expand_mentions": false,
    "total_entities": 3,
    "total_events": 12,
    "evidence_notes_included": 4,
    "unresolved_references": 1
  }
}

3.6 Acceptance Criteria

lore timeline <query> returns chronologically ordered events
Seed entities found via FTS5 keyword search (issues, MRs, and notes)
State, label, and milestone events interleaved from resource event tables
Entity creation and merge events included
Evidence-bearing notes included as note_evidence events (top FTS5 matches, bounded default 10)
Cross-reference expansion follows entity_references to configurable depth
Default expansion follows closes + related edges; --expand-mentions adds mentioned edges
--depth 0 disables expansion
--since filters by event timestamp
-p scopes to project
Human output is colored and readable
Robot mode returns structured JSON with expansion provenance (via) for expanded entities
Unresolved (external) references included in JSON output

Gate 4: File Decision History (`lore file-history`)

4.1 Schema (Migration 011)

File: migrations/011_file_changes.sql

-- Files changed by each merge request
-- Source: GET /projects/:id/merge_requests/:iid/diffs
CREATE TABLE mr_file_changes (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER NOT NULL REFERENCES merge_requests(id) ON DELETE CASCADE,
  project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  old_path TEXT,                      -- NULL for new files
  new_path TEXT NOT NULL,
  change_type TEXT NOT NULL CHECK (change_type IN ('added', 'modified', 'deleted', 'renamed')),
  UNIQUE(merge_request_id, new_path)
);

CREATE INDEX idx_mr_files_new_path ON mr_file_changes(new_path);
CREATE INDEX idx_mr_files_old_path ON mr_file_changes(old_path)
  WHERE old_path IS NOT NULL;
CREATE INDEX idx_mr_files_mr ON mr_file_changes(merge_request_id);

-- Add commit SHAs to merge_requests (cherry-picked from Phase A)
-- These link MRs to actual git history
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;

4.2 Config Extension

{
  "sync": {
    "fetchMrFileChanges": true
  }
}

Opt-in. When enabled, the sync pipeline fetches GET /projects/:id/merge_requests/:iid/diffs for each changed MR and extracts file metadata. Diff content is not stored — only file paths and change types.

4.3 Ingestion

Uses the generic dependent fetch queue (job_type = 'mr_diffs'):

After MR ingestion, if fetchMrFileChanges is true, enqueue a mr_diffs job in pending_dependent_fetches.
Parse response: changes[].{old_path, new_path, new_file, renamed_file, deleted_file}.
Derive change_type:
- new_file == true → 'added'
- renamed_file == true → 'renamed'
- deleted_file == true → 'deleted'
- else → 'modified'
Upsert into mr_file_changes. On re-sync, DELETE existing rows for the MR and re-insert (diffs can change if MR is rebased).

API call cost: 1 additional call per MR. Acceptable for incremental sync (10–50 MRs/day).

4.4 Command Design

# Show decision history for a file
lore file-history src/auth/oauth.rs

# Scoped to project (required if file path exists in multiple projects)
lore file-history src/auth/oauth.rs -p group/repo

# Include discussions on the MRs
lore file-history src/auth/oauth.rs --discussions

# Follow rename chains (default: on)
lore file-history src/auth/oauth.rs                     # follows renames automatically
lore file-history src/auth/oauth.rs --no-follow-renames # disable rename chain resolution

# Limit results
lore file-history src/auth/oauth.rs -n 10

# Filter to merged MRs only
lore file-history src/auth/oauth.rs --merged

# Robot mode
lore -J file-history src/auth/oauth.rs

4.5 Query Logic

SELECT
  mr.iid,
  mr.title,
  mr.state,
  mr.author_username,
  mr.merged_at,
  mr.created_at,
  mr.web_url,
  mr.merge_commit_sha,
  mfc.change_type,
  mfc.old_path,
  (SELECT COUNT(*) FROM discussions d
   WHERE d.merge_request_id = mr.id) AS discussion_count,
  (SELECT COUNT(*) FROM notes n
   JOIN discussions d ON n.discussion_id = d.id
   WHERE d.merge_request_id = mr.id
     AND n.position_new_path = ?1) AS file_discussion_count
FROM mr_file_changes mfc
JOIN merge_requests mr ON mr.id = mfc.merge_request_id
WHERE mfc.new_path = ?1 OR mfc.old_path = ?1
ORDER BY COALESCE(mr.merged_at, mr.created_at) DESC;

For each MR, optionally fetch related issues via entity_references (Gate 2 data).

4.6 Rename Handling

File renames are tracked via old_path and resolved as bounded chains:

Start with the query path in the path set: {src/auth/oauth.rs}
Search mr_file_changes for rows where change_type = 'renamed' and either new_path or old_path is in the path set
Add the other side of each rename to the path set
Repeat until no new paths are discovered, up to a maximum of 10 hops (configurable)
Use the full path set for the file history query

Safeguards:

Hop cap (default 10) prevents runaway expansion
Cycle detection: if a path is already in the set, skip it
The unioned path set is used for matching MRs in the main query

Output:

Human mode annotates the rename chain: "src/auth/oauth.rs (renamed from src/auth/handler.rs ← src/auth.rs)"
Robot mode JSON includes rename_chain: ["src/auth.rs", "src/auth/handler.rs", "src/auth/oauth.rs"]
--no-follow-renames disables chain resolution (matches only the literal path provided)

4.7 Acceptance Criteria

mr_file_changes table populated from GitLab diffs API
merge_commit_sha and squash_commit_sha captured in merge_requests
lore file-history <path> returns MRs ordered by merge/creation date
Output includes: MR title, state, author, change type, discussion count
--discussions shows inline discussion snippets from DiffNotes on the file
Rename chains resolved with bounded hop count (default 10) and cycle detection
--no-follow-renames disables chain resolution
Robot mode JSON includes rename_chain when renames are detected
Robot mode JSON output
-p required when path exists in multiple projects (Ambiguous error)

Gate 5: Code Trace (`lore trace`)

5.1 Overview

lore trace answers "Why was this code introduced?" by tracing from a file (and optionally a line number) back through the MR and issue that motivated the change.

5.2 Two-Tier Architecture

Tier 1 — API-only (no local git required):

Uses merge_commit_sha and squash_commit_sha from the merge_requests table to link MRs to commits. Combined with mr_file_changes, this can answer "which MRs touched this file" and link to their motivating issues via entity_references.

This is equivalent to lore file-history enriched with issue context — effectively a file-scoped decision timeline.

Tier 2 — Git integration (requires local clone):

Uses git blame to map a specific line to a commit SHA, then resolves the commit to an MR via merge_commit_sha lookup. This provides line-level precision.

Gate 5 ships Tier 1 only. Tier 2 (git integration via git2-rs) is a future enhancement.

5.3 Command Design

# Trace a file's history (Tier 1: API-only)
lore trace src/auth/oauth.rs

# Trace a specific line (Tier 2: requires local git)
lore trace src/auth/oauth.rs:45

# Robot mode
lore -J trace src/auth/oauth.rs

5.4 Query Flow (Tier 1)

1. Find MRs that touched this file (mr_file_changes)
   ↓
2. For each MR, find related issues (entity_references WHERE reference_type = 'closes')
   ↓
3. For each issue, fetch discussions with rationale
   ↓
4. Build trace chain: file → MR → issue → discussions
   ↓
5. Order by merge date (most recent first)

5.5 Output Format (Human)

lore trace src/auth/oauth.rs

Trace: src/auth/oauth.rs
────────────────────────

!567  feat: add OAuth2 provider                    MERGED 2024-03-25
  → Closes #234: Migrate to OAuth2
  → 12 discussion comments, 4 on this file
  → Decision: Use rust-oauth2 crate (discussed in #234, comment by @alice)

!612  fix: token refresh race condition             MERGED 2024-04-10
  → Closes #299: OAuth2 login fails for SSO users
  → 5 discussion comments, 2 on this file
  → [src/auth/oauth.rs:45] "Add mutex around refresh to prevent double-refresh"

!701  refactor: extract TokenManager                MERGED 2024-05-01
  → Related: #312: Reduce auth module complexity
  → 3 discussion comments
  → Note: file was renamed from src/auth/handler.rs

5.6 Tier 2 Design Notes (Future — Not in This Phase)

When git integration is added:

Add git2-rs dependency for native git operations
Implement git blame -L <line>,<line> <file> to get commit SHA for a specific line
Look up commit SHA in merge_requests.merge_commit_sha or merge_requests.squash_commit_sha
If no match (commit was squashed), search merge_commit_sha for commits in the blame range
Optional blame_cache table for performance (invalidated by content hash)

Known limitation: Squash commits break blame-to-MR mapping for individual commits within an MR. The squash commit SHA maps to the MR, but all lines show the same commit. This is a fundamental Git limitation documented in GitLab Forum #77146.

5.7 Acceptance Criteria (Tier 1 Only)

lore trace <file> shows MRs that touched the file with linked issues and discussion context
Output includes the MR → issue → discussion chain
Discussion snippets show DiffNote content on the traced file
Cross-references from entity_references used for MR→issue linking
Robot mode JSON output
Graceful handling when no MR data found ("Run lore sync with fetchMrFileChanges: true")

Migration Strategy

Migration Numbering

Phase B uses migration numbers starting at 010:

Migration	Content	Gate
010	Resource event tables, generic dependent fetch queue, entity_references	Gates 1, 2
011	mr_file_changes, merge_commit_sha, squash_commit_sha	Gate 4

Phase A's complete field capture migration should use 012+ when implemented, skipping fields already added by 011 (merge_commit_sha, squash_commit_sha).

Backward Compatibility

All new tables are additive (no ALTER on existing data-bearing columns)
lore sync works without event data — temporal commands gracefully report "No event data. Run lore sync to populate."
Existing search, issues, mrs commands are unaffected

Risks and Mitigations

Identified During Premortem

Risk	Severity	Mitigation
API call volume explosion (3 event calls per entity)	Medium	Incremental sync limits to changed entities; opt-in config flag
System note parsing fragile for non-English instances	Medium	Used only for assignee changes and cross-refs; `source_method` tracks provenance
GitLab diffs API returns large payloads	Low	Extract file metadata only, discard diff content
Cross-reference graph traversal unbounded	Medium	BFS depth capped at configurable limit (default 1); `mentioned` edges excluded by default
Cross-project references lost when target not synced	Medium	Unresolved references stored with `target_entity_id = NULL`; still appear in timeline output
Phase A migration numbering conflict	Low	Phase B uses 010-011; Phase A uses 012+
Timeline output lacks "why" evidence	Medium	Evidence-bearing notes from FTS5 included as first-class timeline events
Squash commits break blame-to-MR mapping	Medium	Tier 2 (git integration) deferred; Tier 1 uses file-level MR matching

Accepted Limitations

No real-time monitoring. Phase B is batch queries over historical data. "Notify me when my code changes" requires a different architecture (webhooks, polling daemon) and is out of scope.
No pattern evolution. Cross-project trend detection requires all of Phase B's infrastructure plus semantic clustering. Deferred to Phase C.
English-only system note parsing. Cross-reference extraction from system notes works reliably only for English-language GitLab instances. Structured API data works for all languages.
Bounded rename chain resolution. lore file-history resolves rename chains up to 10 hops with cycle detection. Pathological rename histories (>10 hops) are truncated.
Evidence notes are keyword-matched, not summarized. Timeline evidence notes are the raw FTS5-matched note text, not AI-generated summaries. This keeps the system deterministic and avoids LLM dependencies.

Success Metrics

Metric	Target
`lore timeline` query latency	< 200ms for typical queries (< 50 seed entities)
Timeline event coverage	State + label + creation + merge + evidence note events for all synced entities
Timeline evidence quality	Top 10 FTS5-matched notes included per query; at least 1 evidence note for queries matching discussion-bearing entities
Cross-reference coverage	> 80% of "closed by MR" relationships captured via structured API
Unresolved reference capture	Cross-project references stored even when target project is not synced
Incremental sync overhead	< 5% increase in sync time for event fetching
`lore file-history` coverage	File changes captured for all synced MRs (when opt-in enabled)
Rename chain resolution	Multi-hop renames correctly resolved up to 10 hops

Future Phases (Out of Scope)

Phase C: Advanced Temporal Features

Pattern Evolution: cross-project trend detection via embedding clusters
Git integration (Tier 2): git blame → commit → MR resolution
MCP server: expose timeline, file-history, trace as typed MCP tools

Phase D: Consumer Applications

Web UI: separate frontend consuming lore's JSON API via lore serve
Real-time monitoring: webhook listener or polling daemon for change notifications
IDE integration: editor plugins surfacing temporal context inline

43 KiB Raw Blame History Unescape Escape