feat: add JSONL source layer with directory scanner and byte-level parser

Implement the bottom of the data pipeline — discovery and parsing of
Claude Code session files:

- source/types.go: Raw JSON deserialization types (RawEntry,
  RawMessage, RawUsage, CacheCreation) matching the Claude Code
  JSONL schema. DiscoveredFile carries file metadata including
  decoded project name, session ID, and subagent relationship info.

- source/scanner.go: ScanDir walks ~/.claude/projects/ to discover
  all .jsonl session files. Detects subagent files by the
  <project>/<session>/subagents/agent-<id>.jsonl path pattern and
  links them to parent sessions. decodeProjectName reverses Claude
  Code's path-encoding convention (/-delimited path segments joined
  with hyphens) by scanning for known parent markers (projects,
  repos, src, code, workspace, dev) and extracting the project name
  after the last marker.

- source/parser.go: ParseFile processes a single JSONL session file.
  Uses a hybrid parsing strategy for performance:

  * "user" and "system" entries: byte-level field extraction for
    timestamps, cwd, and turn_duration (avoids JSON allocation).
    extractTopLevelType tracks brace depth and string boundaries to
    find only the top-level "type" field, early-exiting ~400 bytes
    in for O(1) per line cost regardless of line length.

  * "assistant" entries: full JSON unmarshal to extract token usage,
    model name, and cost data.

  Deduplicates API calls by message.id (keeping the last entry per
  ID, which holds the final billed usage). Computes per-model cost
  breakdown using config.CalculateCost and aggregates cache hit rate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
teernisse
2026-02-19 13:01:11 -05:00
parent 8984d5062d
commit ad484a2a6f
3 changed files with 546 additions and 0 deletions

58
internal/source/types.go Normal file
View File

@@ -0,0 +1,58 @@
package source
// RawEntry represents a single line in a Claude Code JSONL session file.
type RawEntry struct {
Type string `json:"type"`
Subtype string `json:"subtype,omitempty"`
Timestamp string `json:"timestamp,omitempty"`
SessionID string `json:"sessionId,omitempty"`
Cwd string `json:"cwd,omitempty"`
Version string `json:"version,omitempty"`
Message *RawMessage `json:"message,omitempty"`
// For system entries with subtype "turn_duration"
DurationMs int64 `json:"durationMs,omitempty"`
// For progress entries with turn_duration data
Data *RawProgressData `json:"data,omitempty"`
}
// RawProgressData holds typed progress data from system/progress entries.
type RawProgressData struct {
Type string `json:"type"`
DurationMs int64 `json:"durationMs,omitempty"`
}
// RawMessage represents the assistant's message envelope.
type RawMessage struct {
ID string `json:"id"`
Role string `json:"role"`
Model string `json:"model"`
Usage *RawUsage `json:"usage,omitempty"`
}
// RawUsage holds token counts from the API response.
type RawUsage struct {
InputTokens int64 `json:"input_tokens"`
OutputTokens int64 `json:"output_tokens"`
CacheCreationInputTokens int64 `json:"cache_creation_input_tokens"`
CacheReadInputTokens int64 `json:"cache_read_input_tokens"`
CacheCreation *CacheCreation `json:"cache_creation,omitempty"`
ServiceTier string `json:"service_tier"`
}
// CacheCreation holds the breakdown of cache write tokens by TTL bucket.
type CacheCreation struct {
Ephemeral5mInputTokens int64 `json:"ephemeral_5m_input_tokens"`
Ephemeral1hInputTokens int64 `json:"ephemeral_1h_input_tokens"`
}
// DiscoveredFile represents a JSONL file found during directory scanning.
type DiscoveredFile struct {
Path string
Project string // decoded display name (e.g., "gitlore")
ProjectDir string // raw directory name
SessionID string // extracted from filename
IsSubagent bool
ParentSession string // for subagents: parent session UUID
}