Files

Taylor Eernisse 11fe02fac9 docs: add proposed code file reorganization plan

Planning document for the ongoing test extraction and code organization
effort. Covers module-by-module analysis, proposed file splits, and
phased execution plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-13 10:54:56 -05:00

23 KiB

Raw Blame History

Proposed Code File Reorganization Plan

Executive Summary

The codebase is 79 Rust source files / 46K lines across 7 top-level modules. Most modules (gitlab/, embedding/, search/, documents/, ingestion/) are well-organized. The pain points are:

core/ is a grab-bag — 22 files mixing infrastructure, domain logic, DB operations, and an entire timeline pipeline
main.rs is 2713 lines — ~30 handler functions that bridge CLI args to commands
cli/mod.rs is 949 lines — every clap argument struct is packed into one file
Giant command files — who.rs (6067 lines), list.rs (2931 lines) are unwieldy

This plan is organized into three tiers based on impact-to-risk ratio. Tier 1 changes are "no-brainers" — they reduce confusion with minimal import churn. Tier 2 changes are valuable but involve more cross-cutting import updates. Tier 3 changes are "maybe later" — they'd be nice but the juice might not be worth the squeeze right now.

Current Structure (Annotated)

src/
├── main.rs              (2713 lines) ← dispatch + ~30 handler functions + error helpers
├── lib.rs               (9 lines)
├── cli/
│   ├── mod.rs           (949 lines)  ← ALL clap arg structs crammed here
│   ├── autocorrect.rs   (945 lines)
│   ├── progress.rs      (92 lines)
│   ├── robot.rs         (111 lines)
│   └── commands/
│       ├── mod.rs       (50 lines) — re-exports
│       ├── auth_test.rs
│       ├── count.rs     (406 lines)
│       ├── doctor.rs    (576 lines)
│       ├── drift.rs     (642 lines)
│       ├── embed.rs
│       ├── generate_docs.rs (320 lines)
│       ├── ingest.rs    (1064 lines)
│       ├── init.rs      (174 lines)
│       ├── list.rs      (2931 lines) ← handles issues, MRs, AND notes listing
│       ├── search.rs    (418 lines)
│       ├── show.rs      (1377 lines)
│       ├── stats.rs     (505 lines)
│       ├── sync_status.rs (454 lines)
│       ├── sync.rs      (576 lines)
│       ├── timeline.rs  (488 lines)
│       └── who.rs       (6067 lines) ← 5 sub-modes: expert, workload, active, overlap, reviews
├── core/
│   ├── mod.rs           (25 lines)
│   ├── backoff.rs       ← retry logic (used by ingestion)
│   ├── config.rs        (789 lines) ← configuration types
│   ├── db.rs            (970 lines) ← connection + 22 migrations
│   ├── dependent_queue.rs (330 lines) ← job queue (used by ingestion orchestrator)
│   ├── error.rs         (295 lines) ← error enum + exit codes
│   ├── events_db.rs     (199 lines) ← resource event upserts (used by ingestion)
│   ├── lock.rs          (228 lines) ← filesystem sync lock
│   ├── logging.rs       (179 lines) ← tracing filter builders
│   ├── metrics.rs       (566 lines) ← tracing-based stage timing
│   ├── note_parser.rs   (563 lines) ← cross-ref extraction from note bodies
│   ├── paths.rs         ← config/db/log file path resolution
│   ├── payloads.rs      (204 lines) ← raw JSON payload storage
│   ├── project.rs       (274 lines) ← fuzzy project resolution from DB
│   ├── references.rs    (551 lines) ← entity cross-reference extraction
│   ├── shutdown.rs      ← graceful shutdown via tokio signal
│   ├── sync_run.rs      (218 lines) ← sync run recording to DB
│   ├── time.rs          ← time conversion utilities
│   ├── timeline.rs      (284 lines) ← timeline types + EntityRef
│   ├── timeline_collect.rs (695 lines) ← Stage 4: collect events from DB
│   ├── timeline_expand.rs (557 lines) ← Stage 3: expand via cross-refs
│   └── timeline_seed.rs (552 lines) ← Stage 1: FTS search seeding
├── documents/           ← well-organized, 3 focused files
├── embedding/           ← well-organized, 6 focused files
├── gitlab/              ← well-organized, with transformers/ subdir
├── ingestion/           ← well-organized, 8 focused files
└── search/              ← well-organized, 5 focused files

Tier 1: No-Brainers (Do First)

1.1 Extract `timeline/` from `core/`

What: Move the 4 timeline files into their own top-level module src/timeline/.

Current location:

core/timeline.rs (284 lines) — types: EntityRef, ExpandedEntityRef, TimelineEvent, TimelineEventType, etc.
core/timeline_seed.rs (552 lines) — Stage 1: FTS-based seeding
core/timeline_expand.rs (557 lines) — Stage 3: cross-reference expansion
core/timeline_collect.rs (695 lines) — Stage 4: event collection from DB

New structure:

src/timeline/
├── mod.rs       ← types (from timeline.rs) + re-exports
├── seed.rs      ← from timeline_seed.rs
├── expand.rs    ← from timeline_expand.rs
└── collect.rs   ← from timeline_collect.rs

Rationale: These 4 files form a cohesive 5-stage pipeline (SEED→HYDRATE→EXPAND→COLLECT→RENDER). They have nothing to do with "core" infrastructure like db.rs, config.rs, or error.rs. They only import from core::error, core::time, and search::fts — all of which remain accessible via crate::core::* and crate::search::* after the move.

Import changes needed:

cli/commands/timeline.rs: use crate::core::timeline::* → use crate::timeline::*, same for timeline_seed, timeline_expand, timeline_collect
core/mod.rs: remove the 4 pub mod timeline* lines
lib.rs: add pub mod timeline;

Risk: LOW — Only 1 consumer (cli/commands/timeline.rs) + internal cross-references between the 4 files.

1.2 Extract `xref/` (cross-reference extraction) from `core/`

What: Move note_parser.rs and references.rs into src/xref/.

Current location:

core/note_parser.rs (563 lines) — parses note bodies for "mentioned in group/repo#123" patterns, persists to note_cross_references table
core/references.rs (551 lines) — extracts entity references from state events and closing MRs, writes to entity_references table

New structure:

src/xref/
├── mod.rs           ← re-exports
├── note_parser.rs   ← from core/note_parser.rs
└── references.rs    ← from core/references.rs

Rationale: These files implement a specific domain concept — extracting and persisting cross-references between issues and MRs. They are not "core infrastructure." They're consumed by ingestion/orchestrator.rs for the cross-reference extraction phase, and the data they produce is consumed by the timeline pipeline. Putting them in their own module makes the data flow clearer: ingestion → xref → timeline.

Import changes needed:

ingestion/orchestrator.rs: use crate::core::references::* → use crate::xref::references::*
ingestion/orchestrator.rs: use crate::core::note_parser::* (if used directly — needs verification) → use crate::xref::*
core/mod.rs: remove pub mod note_parser; pub mod references;
lib.rs: add pub mod xref;
Internal: the files use super::error::Result and super::time::now_ms which become crate::core::error::Result and crate::core::time::now_ms

Risk: LOW — 2-3 consumers at most. The files already use super:: internally which just needs updating to crate::core::.

Tier 2: Good Improvements (Do After Tier 1)

2.1 Group ingestion-adjacent DB operations

What: Move events_db.rs, dependent_queue.rs, payloads.rs, and sync_run.rs from core/ into ingestion/ since they exclusively serve the ingestion pipeline.

Current consumers:

events_db.rs → only used by cli/commands/count.rs (for event counts)
dependent_queue.rs → only used by ingestion/orchestrator.rs and main.rs (to release locked jobs)
payloads.rs → only used by ingestion/discussions.rs, ingestion/issues.rs, ingestion/merge_requests.rs, ingestion/mr_discussions.rs
sync_run.rs → only used by cli/commands/sync.rs and cli/commands/sync_status.rs

New structure:

src/ingestion/
├── (existing files...)
├── events_db.rs       ← from core/events_db.rs
├── dependent_queue.rs ← from core/dependent_queue.rs
├── payloads.rs        ← from core/payloads.rs
└── sync_run.rs        ← from core/sync_run.rs

Rationale: All 4 files exist to support the ingestion pipeline:

events_db.rs upserts resource state/label/milestone events fetched during ingestion
dependent_queue.rs manages the job queue that drives incremental discussion fetching
payloads.rs stores the raw JSON payloads fetched from GitLab
sync_run.rs records when syncs start/finish and their metrics

When you're looking for "how does ingestion work?", you'd naturally look in ingestion/. Having these scattered in core/ requires knowing the hidden dependency.

Import changes needed:

events_db.rs: 1 consumer in cli/commands/count.rs changes from crate::core::events_db → crate::ingestion::events_db
dependent_queue.rs: 2 consumers — ingestion/orchestrator.rs (becomes super::dependent_queue) and main.rs
payloads.rs: 4 consumers in ingestion/*.rs (become super::payloads)
sync_run.rs: 2 consumers in cli/commands/sync.rs and sync_status.rs
Internal references change from super::error / super::time to crate::core::error / crate::core::time

Risk: MEDIUM — More import changes, but all straightforward. The internal super:: references need the most attention.

Alternatively: If moving feels like too much churn, a lighter option is to create core/ingestion_db.rs that re-exports from these 4 files, making the grouping visible without moving files. But I think the move is cleaner.

2.2 Split `cli/mod.rs` — move arg structs to their command files

What: Move each *Args struct from cli/mod.rs into the corresponding cli/commands/*.rs file. Keep Cli struct, Commands enum, and detect_robot_mode_from_env() in cli/mod.rs.

Currently cli/mod.rs (949 lines) contains:

Cli struct (81 lines) — the root clap parser
Commands enum (193 lines) — all subcommand variants
IssuesArgs (86 lines) → move to commands/list.rs or stay near issues handling
MrsArgs (93 lines) → move to commands/list.rs or stay near MRs handling
NotesArgs (99 lines) → move to commands/list.rs
IngestArgs (33 lines) → move to commands/ingest.rs
StatsArgs (19 lines) → move to commands/stats.rs
SearchArgs (58 lines) → move to commands/search.rs
GenerateDocsArgs (9 lines) → move to commands/generate_docs.rs
SyncArgs (39 lines) → move to commands/sync.rs
EmbedArgs (15 lines) → move to commands/embed.rs
TimelineArgs (53 lines) → move to commands/timeline.rs
WhoArgs (76 lines) → move to commands/who.rs
CountArgs (9 lines) → move to commands/count.rs

After refactoring, cli/mod.rs shrinks to ~300 lines (just Cli + Commands + the inlined variants like Init, Drift, Backup, Reset).

Rationale: When adding a new flag to the who command, you currently have to edit cli/mod.rs (the args struct), cli/commands/who.rs (the implementation), and main.rs (the dispatch). If the args struct lives in commands/who.rs, you only need two files. This is the standard pattern in mature clap-based Rust CLIs.

Import changes needed:

main.rs currently does use lore::cli::{..., WhoArgs, ...} — these would become use lore::cli::commands::{..., WhoArgs, ...} or the commands/mod.rs re-exports them
Each commands/*.rs gets its own #[derive(Parser)] struct
Commands enum in cli/mod.rs keeps using the types but imports from commands::*

Risk: MEDIUM — Lots of use path changes in main.rs, but purely mechanical. No logic changes.

Tier 3: Consider Later

3.1 Split `main.rs` (2713 lines)

The problem: main.rs contains main(), ~30 handle_* functions, error handling, clap error formatting, fuzzy command matching, and the robot-docs JSON manifest (a 400+ line inline JSON literal).

Possible approach:

Extract handle_* functions into cli/dispatch.rs (the routing layer)
Extract error handling into cli/errors.rs
Extract handle_robot_docs + the JSON manifest into cli/robot_docs.rs
Keep main() in main.rs at ~150 lines (just the tracing setup + dispatch call)

Why Tier 3: This is the messiest split. The handler functions depend on the cli::commands::* functions AND the cli::robot::* helpers AND direct std::process::exit calls. Making this work cleanly requires careful thought about the error boundary between main.rs (binary) and lib.rs (library).

Risk: HIGH — Every handler function touches robot_mode, constructs its own timer, opens the DB, and manages error display. The boilerplate is high but consistent, so splitting would just move it around without reducing complexity.

3.2 Split `cli/commands/who.rs` (6067 lines)

The problem: This file implements 5 distinct modes (expert, workload, active, overlap, reviews), each with its own query, scoring model, and output formatting. It also includes the time-decay scoring model (~500 lines) and per-MR detail breakdown logic.

Possible split:

src/cli/commands/who/
├── mod.rs         ← WhoRun dispatcher, shared types
├── expert.rs      ← expert mode (path-based file expertise lookup)
├── workload.rs    ← workload mode (user's assigned issues/MRs)
├── active.rs      ← active discussions mode
├── overlap.rs     ← file overlap between users
├── reviews.rs     ← review pattern analysis
└── scoring.rs     ← time-decay expert scoring model

Why Tier 3: The 5 modes share many helper functions, database connection patterns, and output formatting logic. Splitting would require carefully identifying the shared helpers and deciding where they live. The file is big but internally consistent — the modes use a shared dispatcher pattern and common types.

3.3 Split `cli/commands/list.rs` (2931 lines)

The problem: This file handles issue listing, MR listing, AND note listing — three related but distinct operations with separate query builders, output formatters, and test suites.

Possible split:

src/cli/commands/
├── list_issues.rs   ← issue listing + query builder
├── list_mrs.rs      ← MR listing + query builder
├── list_notes.rs    ← note listing + query builder
└── list.rs          ← shared types (ListFilters, etc.) + re-exports

Why Tier 3: Same issue as who.rs — the three listing modes share query building patterns, field selection logic, and sorting code. Splitting requires identifying and extracting the shared pieces first.

Files NOT Recommended to Move

These files belong exactly where they are:

File	Why it belongs in `core/`
`config.rs`	Config types used by nearly everything
`db.rs`	Database connection + migrations — foundational
`error.rs`	Error types used by every module
`paths.rs`	File path resolution — infrastructure
`logging.rs`	Tracing setup — infrastructure
`lock.rs`	Filesystem sync lock — infrastructure
`shutdown.rs`	Graceful shutdown signal — infrastructure
`backoff.rs`	Retry math — infrastructure
`time.rs`	Time conversion — used everywhere
`metrics.rs`	Tracing metrics layer — infrastructure
`project.rs`	Fuzzy project resolution — used by 8+ consumers across modules

These files are legitimate "core infrastructure" used across multiple modules. Moving them would create import churn with no clarity gain.

Files NOT Recommended to Split/Merge

File	Why leave it alone
`documents/extractor.rs` (2341 lines)	One cohesive extractor per entity type — the size comes from per-type formatting logic, not mixed concerns
`ingestion/orchestrator.rs` (1703 lines)	Single orchestration flow — splitting would scatter the pipeline
`gitlab/graphql.rs` (1293 lines)	GraphQL client with adaptive paging — cohesive
`gitlab/client.rs` (851 lines)	REST client with all endpoints — cohesive
`cli/autocorrect.rs` (945 lines)	Correction registry + fuzzy matching — splitting gains nothing

Proposed Final Structure (Tiers 1+2)

src/
├── main.rs              (2713 lines — unchanged for now)
├── lib.rs               (adds: pub mod timeline; pub mod xref;)
├── cli/
│   ├── mod.rs           (~300 lines — Cli + Commands only, args moved out)
│   ├── autocorrect.rs   (unchanged)
│   ├── progress.rs      (unchanged)
│   ├── robot.rs         (unchanged)
│   └── commands/
│       ├── mod.rs       (re-exports + WhoArgs, IssuesArgs, etc.)
│       ├── (all existing files — unchanged but with args structs moved in)
│       └── ...
├── core/                (slimmed: 14 files → infrastructure only)
│   ├── mod.rs
│   ├── backoff.rs
│   ├── config.rs
│   ├── db.rs
│   ├── error.rs
│   ├── lock.rs
│   ├── logging.rs
│   ├── metrics.rs
│   ├── paths.rs
│   ├── project.rs
│   ├── shutdown.rs
│   └── time.rs
├── timeline/            (NEW — extracted from core/)
│   ├── mod.rs           (types from core/timeline.rs)
│   ├── seed.rs          (from core/timeline_seed.rs)
│   ├── expand.rs        (from core/timeline_expand.rs)
│   └── collect.rs       (from core/timeline_collect.rs)
├── xref/                (NEW — extracted from core/)
│   ├── mod.rs
│   ├── note_parser.rs   (from core/note_parser.rs)
│   └── references.rs    (from core/references.rs)
├── ingestion/           (gains 4 files from core/)
│   ├── (existing files...)
│   ├── events_db.rs     (from core/events_db.rs)
│   ├── dependent_queue.rs (from core/dependent_queue.rs)
│   ├── payloads.rs      (from core/payloads.rs)
│   └── sync_run.rs      (from core/sync_run.rs)
├── documents/           (unchanged)
├── embedding/           (unchanged)
├── gitlab/              (unchanged)
└── search/              (unchanged)

Import Change Tracking

Tier 1.1: Timeline extraction

Consumer file	Old import	New import
`cli/commands/timeline.rs:10-15`	`crate::core::timeline::*`	`crate::timeline::*`
`cli/commands/timeline.rs:13`	`crate::core::timeline_collect::collect_events`	`crate::timeline::collect_events` (or `crate::timeline::collect::collect_events`)
`cli/commands/timeline.rs:14`	`crate::core::timeline_expand::expand_timeline`	`crate::timeline::expand_timeline`
`cli/commands/timeline.rs:15`	`crate::core::timeline_seed::seed_timeline`	`crate::timeline::seed_timeline`
`core/timeline_seed.rs:7-8`	`super::timeline::*`	`super::` (or `crate::timeline::` depending on structure)
`core/timeline_expand.rs:6`	`super::timeline::*`	`super::*`
`core/timeline_collect.rs:4`	`super::timeline::*`	`super::*`
`core/timeline_seed.rs:8`	`crate::search::*`	`crate::search::*` (no change)
`core/timeline_seed.rs:6-7`	`super::error::Result`	`crate::core::error::Result`
`core/timeline_expand.rs:5`	`super::error::Result`	`crate::core::error::Result`
`core/timeline_collect.rs:3`	`super::error::*`	`crate::core::error::*`

Tier 1.2: Cross-reference extraction

Consumer file	Old import	New import
`ingestion/orchestrator.rs:10-12`	`crate::core::references::*`	`crate::xref::references::*`
`core/note_parser.rs:7-8`	`super::error::Result`, `super::time::now_ms`	`crate::core::error::Result`, `crate::core::time::now_ms`
`core/references.rs:4-5`	`super::error::Result`, `super::time::now_ms`	`crate::core::error::Result`, `crate::core::time::now_ms`

Tier 2.1: Ingestion-adjacent DB ops

Consumer file	Old import	New import
`cli/commands/count.rs:9`	`crate::core::events_db::*`	`crate::ingestion::events_db::*`
`ingestion/orchestrator.rs:6-8`	`crate::core::dependent_queue::*`	`super::dependent_queue::*`
`main.rs:37`	`crate::core::dependent_queue::release_all_locked_jobs`	`crate::ingestion::dependent_queue::release_all_locked_jobs`
`ingestion/discussions.rs:7`	`crate::core::payloads::*`	`super::payloads::*`
`ingestion/issues.rs:9`	`crate::core::payloads::*`	`super::payloads::*`
`ingestion/merge_requests.rs:8`	`crate::core::payloads::*`	`super::payloads::*`
`ingestion/mr_discussions.rs:7`	`crate::core::payloads::*`	`super::payloads::*`
`cli/commands/sync.rs`	(uses `crate::core::sync_run::*`)	`crate::ingestion::sync_run::*`
`cli/commands/sync_status.rs`	(uses `crate::core::sync_run::` or `crate::core::metrics::`)	check and update
Internal: `events_db.rs:4-5`	`super::error::`, `super::time::`	`crate::core::error::`, `crate::core::time::`
Internal: `dependent_queue.rs:5-6`	`super::error::Result`, `super::time::now_ms`	`crate::core::error::Result`, `crate::core::time::now_ms`
Internal: `payloads.rs:9-10`	`super::error::Result`, `super::time::now_ms`	`crate::core::error::Result`, `crate::core::time::now_ms`
Internal: `sync_run.rs:2-4`	`super::error::`, `super::metrics::`, `super::time::*`	`crate::core::error::`, `crate::core::metrics::`, `crate::core::time::*`

Execution Order

Tier 1.1 — Extract timeline → src/timeline/ (LOW risk, 1 consumer)
Tier 1.2 — Extract xref → src/xref/ (LOW risk, 1-2 consumers)
Cargo check + clippy + test after each tier
Tier 2.1 — Move ingestion DB ops (MEDIUM risk, more consumers)
Cargo check + clippy + test
Tier 2.2 — Split cli/mod.rs args (MEDIUM risk, mostly mechanical)
Cargo check + clippy + test + fmt

Each tier should be its own commit for easy rollback.

What This Achieves

Before: A developer looking at core/ sees 22 files and has to mentally sort "infrastructure vs. domain logic vs. pipeline stage." The timeline pipeline is invisible unless you know to look in core/.

After:

core/ has 12 files, all clearly infrastructure (db, config, error, paths, logging, lock, shutdown, backoff, time, metrics, project)
timeline/ is a discoverable first-class module showing the 5-stage pipeline
xref/ makes the cross-reference extraction domain visible
ingestion/ contains everything related to data fetching: the orchestrator, entity ingestors, AND their supporting DB operations
cli/mod.rs is lean — just the top-level Cli struct and Commands enum

A new developer (or coding agent) can now answer "where is the timeline code?" → src/timeline/, "where is ingestion?" → src/ingestion/, "where is cross-reference extraction?" → src/xref/, without needing institutional knowledge.

23 KiB Raw Blame History

Proposed Code File Reorganization Plan

Executive Summary

Current Structure (Annotated)

Tier 1: No-Brainers (Do First)

1.1 Extract timeline/ from core/

1.2 Extract xref/ (cross-reference extraction) from core/

Tier 2: Good Improvements (Do After Tier 1)

2.1 Group ingestion-adjacent DB operations

2.2 Split cli/mod.rs — move arg structs to their command files

Tier 3: Consider Later

3.1 Split main.rs (2713 lines)

3.2 Split cli/commands/who.rs (6067 lines)

3.3 Split cli/commands/list.rs (2931 lines)

Files NOT Recommended to Move

Files NOT Recommended to Split/Merge

Proposed Final Structure (Tiers 1+2)

Import Change Tracking

Tier 1.1: Timeline extraction

Tier 1.2: Cross-reference extraction

Tier 2.1: Ingestion-adjacent DB ops

Execution Order

What This Achieves

23 KiB

Raw Blame History

1.1 Extract `timeline/` from `core/`

1.2 Extract `xref/` (cross-reference extraction) from `core/`

2.2 Split `cli/mod.rs` — move arg structs to their command files

3.1 Split `main.rs` (2713 lines)

3.2 Split `cli/commands/who.rs` (6067 lines)

3.3 Split `cli/commands/list.rs` (2931 lines)