Files
gitlore/PROPOSED_CODE_FILE_REORGANIZATION_PLAN.md
Taylor Eernisse 11fe02fac9 docs: add proposed code file reorganization plan
Planning document for the ongoing test extraction and code organization
effort. Covers module-by-module analysis, proposed file splits, and
phased execution plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 10:54:56 -05:00

23 KiB

Proposed Code File Reorganization Plan

Executive Summary

The codebase is 79 Rust source files / 46K lines across 7 top-level modules. Most modules (gitlab/, embedding/, search/, documents/, ingestion/) are well-organized. The pain points are:

  1. core/ is a grab-bag — 22 files mixing infrastructure, domain logic, DB operations, and an entire timeline pipeline
  2. main.rs is 2713 lines — ~30 handler functions that bridge CLI args to commands
  3. cli/mod.rs is 949 lines — every clap argument struct is packed into one file
  4. Giant command fileswho.rs (6067 lines), list.rs (2931 lines) are unwieldy

This plan is organized into three tiers based on impact-to-risk ratio. Tier 1 changes are "no-brainers" — they reduce confusion with minimal import churn. Tier 2 changes are valuable but involve more cross-cutting import updates. Tier 3 changes are "maybe later" — they'd be nice but the juice might not be worth the squeeze right now.


Current Structure (Annotated)

src/
├── main.rs              (2713 lines) ← dispatch + ~30 handler functions + error helpers
├── lib.rs               (9 lines)
├── cli/
│   ├── mod.rs           (949 lines)  ← ALL clap arg structs crammed here
│   ├── autocorrect.rs   (945 lines)
│   ├── progress.rs      (92 lines)
│   ├── robot.rs         (111 lines)
│   └── commands/
│       ├── mod.rs       (50 lines) — re-exports
│       ├── auth_test.rs
│       ├── count.rs     (406 lines)
│       ├── doctor.rs    (576 lines)
│       ├── drift.rs     (642 lines)
│       ├── embed.rs
│       ├── generate_docs.rs (320 lines)
│       ├── ingest.rs    (1064 lines)
│       ├── init.rs      (174 lines)
│       ├── list.rs      (2931 lines) ← handles issues, MRs, AND notes listing
│       ├── search.rs    (418 lines)
│       ├── show.rs      (1377 lines)
│       ├── stats.rs     (505 lines)
│       ├── sync_status.rs (454 lines)
│       ├── sync.rs      (576 lines)
│       ├── timeline.rs  (488 lines)
│       └── who.rs       (6067 lines) ← 5 sub-modes: expert, workload, active, overlap, reviews
├── core/
│   ├── mod.rs           (25 lines)
│   ├── backoff.rs       ← retry logic (used by ingestion)
│   ├── config.rs        (789 lines) ← configuration types
│   ├── db.rs            (970 lines) ← connection + 22 migrations
│   ├── dependent_queue.rs (330 lines) ← job queue (used by ingestion orchestrator)
│   ├── error.rs         (295 lines) ← error enum + exit codes
│   ├── events_db.rs     (199 lines) ← resource event upserts (used by ingestion)
│   ├── lock.rs          (228 lines) ← filesystem sync lock
│   ├── logging.rs       (179 lines) ← tracing filter builders
│   ├── metrics.rs       (566 lines) ← tracing-based stage timing
│   ├── note_parser.rs   (563 lines) ← cross-ref extraction from note bodies
│   ├── paths.rs         ← config/db/log file path resolution
│   ├── payloads.rs      (204 lines) ← raw JSON payload storage
│   ├── project.rs       (274 lines) ← fuzzy project resolution from DB
│   ├── references.rs    (551 lines) ← entity cross-reference extraction
│   ├── shutdown.rs      ← graceful shutdown via tokio signal
│   ├── sync_run.rs      (218 lines) ← sync run recording to DB
│   ├── time.rs          ← time conversion utilities
│   ├── timeline.rs      (284 lines) ← timeline types + EntityRef
│   ├── timeline_collect.rs (695 lines) ← Stage 4: collect events from DB
│   ├── timeline_expand.rs (557 lines) ← Stage 3: expand via cross-refs
│   └── timeline_seed.rs (552 lines) ← Stage 1: FTS search seeding
├── documents/           ← well-organized, 3 focused files
├── embedding/           ← well-organized, 6 focused files
├── gitlab/              ← well-organized, with transformers/ subdir
├── ingestion/           ← well-organized, 8 focused files
└── search/              ← well-organized, 5 focused files

Tier 1: No-Brainers (Do First)

1.1 Extract timeline/ from core/

What: Move the 4 timeline files into their own top-level module src/timeline/.

Current location:

  • core/timeline.rs (284 lines) — types: EntityRef, ExpandedEntityRef, TimelineEvent, TimelineEventType, etc.
  • core/timeline_seed.rs (552 lines) — Stage 1: FTS-based seeding
  • core/timeline_expand.rs (557 lines) — Stage 3: cross-reference expansion
  • core/timeline_collect.rs (695 lines) — Stage 4: event collection from DB

New structure:

src/timeline/
├── mod.rs       ← types (from timeline.rs) + re-exports
├── seed.rs      ← from timeline_seed.rs
├── expand.rs    ← from timeline_expand.rs
└── collect.rs   ← from timeline_collect.rs

Rationale: These 4 files form a cohesive 5-stage pipeline (SEED→HYDRATE→EXPAND→COLLECT→RENDER). They have nothing to do with "core" infrastructure like db.rs, config.rs, or error.rs. They only import from core::error, core::time, and search::fts — all of which remain accessible via crate::core::* and crate::search::* after the move.

Import changes needed:

  • cli/commands/timeline.rs: use crate::core::timeline::*use crate::timeline::*, same for timeline_seed, timeline_expand, timeline_collect
  • core/mod.rs: remove the 4 pub mod timeline* lines
  • lib.rs: add pub mod timeline;

Risk: LOW — Only 1 consumer (cli/commands/timeline.rs) + internal cross-references between the 4 files.


1.2 Extract xref/ (cross-reference extraction) from core/

What: Move note_parser.rs and references.rs into src/xref/.

Current location:

  • core/note_parser.rs (563 lines) — parses note bodies for "mentioned in group/repo#123" patterns, persists to note_cross_references table
  • core/references.rs (551 lines) — extracts entity references from state events and closing MRs, writes to entity_references table

New structure:

src/xref/
├── mod.rs           ← re-exports
├── note_parser.rs   ← from core/note_parser.rs
└── references.rs    ← from core/references.rs

Rationale: These files implement a specific domain concept — extracting and persisting cross-references between issues and MRs. They are not "core infrastructure." They're consumed by ingestion/orchestrator.rs for the cross-reference extraction phase, and the data they produce is consumed by the timeline pipeline. Putting them in their own module makes the data flow clearer: ingestion → xref → timeline.

Import changes needed:

  • ingestion/orchestrator.rs: use crate::core::references::*use crate::xref::references::*
  • ingestion/orchestrator.rs: use crate::core::note_parser::* (if used directly — needs verification) → use crate::xref::*
  • core/mod.rs: remove pub mod note_parser; pub mod references;
  • lib.rs: add pub mod xref;
  • Internal: the files use super::error::Result and super::time::now_ms which become crate::core::error::Result and crate::core::time::now_ms

Risk: LOW — 2-3 consumers at most. The files already use super:: internally which just needs updating to crate::core::.


Tier 2: Good Improvements (Do After Tier 1)

2.1 Group ingestion-adjacent DB operations

What: Move events_db.rs, dependent_queue.rs, payloads.rs, and sync_run.rs from core/ into ingestion/ since they exclusively serve the ingestion pipeline.

Current consumers:

  • events_db.rs → only used by cli/commands/count.rs (for event counts)
  • dependent_queue.rs → only used by ingestion/orchestrator.rs and main.rs (to release locked jobs)
  • payloads.rs → only used by ingestion/discussions.rs, ingestion/issues.rs, ingestion/merge_requests.rs, ingestion/mr_discussions.rs
  • sync_run.rs → only used by cli/commands/sync.rs and cli/commands/sync_status.rs

New structure:

src/ingestion/
├── (existing files...)
├── events_db.rs       ← from core/events_db.rs
├── dependent_queue.rs ← from core/dependent_queue.rs
├── payloads.rs        ← from core/payloads.rs
└── sync_run.rs        ← from core/sync_run.rs

Rationale: All 4 files exist to support the ingestion pipeline:

  • events_db.rs upserts resource state/label/milestone events fetched during ingestion
  • dependent_queue.rs manages the job queue that drives incremental discussion fetching
  • payloads.rs stores the raw JSON payloads fetched from GitLab
  • sync_run.rs records when syncs start/finish and their metrics

When you're looking for "how does ingestion work?", you'd naturally look in ingestion/. Having these scattered in core/ requires knowing the hidden dependency.

Import changes needed:

  • events_db.rs: 1 consumer in cli/commands/count.rs changes from crate::core::events_dbcrate::ingestion::events_db
  • dependent_queue.rs: 2 consumers — ingestion/orchestrator.rs (becomes super::dependent_queue) and main.rs
  • payloads.rs: 4 consumers in ingestion/*.rs (become super::payloads)
  • sync_run.rs: 2 consumers in cli/commands/sync.rs and sync_status.rs
  • Internal references change from super::error / super::time to crate::core::error / crate::core::time

Risk: MEDIUM — More import changes, but all straightforward. The internal super:: references need the most attention.

Alternatively: If moving feels like too much churn, a lighter option is to create core/ingestion_db.rs that re-exports from these 4 files, making the grouping visible without moving files. But I think the move is cleaner.


2.2 Split cli/mod.rs — move arg structs to their command files

What: Move each *Args struct from cli/mod.rs into the corresponding cli/commands/*.rs file. Keep Cli struct, Commands enum, and detect_robot_mode_from_env() in cli/mod.rs.

Currently cli/mod.rs (949 lines) contains:

  • Cli struct (81 lines) — the root clap parser
  • Commands enum (193 lines) — all subcommand variants
  • IssuesArgs (86 lines) → move to commands/list.rs or stay near issues handling
  • MrsArgs (93 lines) → move to commands/list.rs or stay near MRs handling
  • NotesArgs (99 lines) → move to commands/list.rs
  • IngestArgs (33 lines) → move to commands/ingest.rs
  • StatsArgs (19 lines) → move to commands/stats.rs
  • SearchArgs (58 lines) → move to commands/search.rs
  • GenerateDocsArgs (9 lines) → move to commands/generate_docs.rs
  • SyncArgs (39 lines) → move to commands/sync.rs
  • EmbedArgs (15 lines) → move to commands/embed.rs
  • TimelineArgs (53 lines) → move to commands/timeline.rs
  • WhoArgs (76 lines) → move to commands/who.rs
  • CountArgs (9 lines) → move to commands/count.rs

After refactoring, cli/mod.rs shrinks to ~300 lines (just Cli + Commands + the inlined variants like Init, Drift, Backup, Reset).

Rationale: When adding a new flag to the who command, you currently have to edit cli/mod.rs (the args struct), cli/commands/who.rs (the implementation), and main.rs (the dispatch). If the args struct lives in commands/who.rs, you only need two files. This is the standard pattern in mature clap-based Rust CLIs.

Import changes needed:

  • main.rs currently does use lore::cli::{..., WhoArgs, ...} — these would become use lore::cli::commands::{..., WhoArgs, ...} or the commands/mod.rs re-exports them
  • Each commands/*.rs gets its own #[derive(Parser)] struct
  • Commands enum in cli/mod.rs keeps using the types but imports from commands::*

Risk: MEDIUM — Lots of use path changes in main.rs, but purely mechanical. No logic changes.


Tier 3: Consider Later

3.1 Split main.rs (2713 lines)

The problem: main.rs contains main(), ~30 handle_* functions, error handling, clap error formatting, fuzzy command matching, and the robot-docs JSON manifest (a 400+ line inline JSON literal).

Possible approach:

  • Extract handle_* functions into cli/dispatch.rs (the routing layer)
  • Extract error handling into cli/errors.rs
  • Extract handle_robot_docs + the JSON manifest into cli/robot_docs.rs
  • Keep main() in main.rs at ~150 lines (just the tracing setup + dispatch call)

Why Tier 3: This is the messiest split. The handler functions depend on the cli::commands::* functions AND the cli::robot::* helpers AND direct std::process::exit calls. Making this work cleanly requires careful thought about the error boundary between main.rs (binary) and lib.rs (library).

Risk: HIGH — Every handler function touches robot_mode, constructs its own timer, opens the DB, and manages error display. The boilerplate is high but consistent, so splitting would just move it around without reducing complexity.


3.2 Split cli/commands/who.rs (6067 lines)

The problem: This file implements 5 distinct modes (expert, workload, active, overlap, reviews), each with its own query, scoring model, and output formatting. It also includes the time-decay scoring model (~500 lines) and per-MR detail breakdown logic.

Possible split:

src/cli/commands/who/
├── mod.rs         ← WhoRun dispatcher, shared types
├── expert.rs      ← expert mode (path-based file expertise lookup)
├── workload.rs    ← workload mode (user's assigned issues/MRs)
├── active.rs      ← active discussions mode
├── overlap.rs     ← file overlap between users
├── reviews.rs     ← review pattern analysis
└── scoring.rs     ← time-decay expert scoring model

Why Tier 3: The 5 modes share many helper functions, database connection patterns, and output formatting logic. Splitting would require carefully identifying the shared helpers and deciding where they live. The file is big but internally consistent — the modes use a shared dispatcher pattern and common types.


3.3 Split cli/commands/list.rs (2931 lines)

The problem: This file handles issue listing, MR listing, AND note listing — three related but distinct operations with separate query builders, output formatters, and test suites.

Possible split:

src/cli/commands/
├── list_issues.rs   ← issue listing + query builder
├── list_mrs.rs      ← MR listing + query builder
├── list_notes.rs    ← note listing + query builder
└── list.rs          ← shared types (ListFilters, etc.) + re-exports

Why Tier 3: Same issue as who.rs — the three listing modes share query building patterns, field selection logic, and sorting code. Splitting requires identifying and extracting the shared pieces first.


These files belong exactly where they are:

File Why it belongs in core/
config.rs Config types used by nearly everything
db.rs Database connection + migrations — foundational
error.rs Error types used by every module
paths.rs File path resolution — infrastructure
logging.rs Tracing setup — infrastructure
lock.rs Filesystem sync lock — infrastructure
shutdown.rs Graceful shutdown signal — infrastructure
backoff.rs Retry math — infrastructure
time.rs Time conversion — used everywhere
metrics.rs Tracing metrics layer — infrastructure
project.rs Fuzzy project resolution — used by 8+ consumers across modules

These files are legitimate "core infrastructure" used across multiple modules. Moving them would create import churn with no clarity gain.


File Why leave it alone
documents/extractor.rs (2341 lines) One cohesive extractor per entity type — the size comes from per-type formatting logic, not mixed concerns
ingestion/orchestrator.rs (1703 lines) Single orchestration flow — splitting would scatter the pipeline
gitlab/graphql.rs (1293 lines) GraphQL client with adaptive paging — cohesive
gitlab/client.rs (851 lines) REST client with all endpoints — cohesive
cli/autocorrect.rs (945 lines) Correction registry + fuzzy matching — splitting gains nothing

Proposed Final Structure (Tiers 1+2)

src/
├── main.rs              (2713 lines — unchanged for now)
├── lib.rs               (adds: pub mod timeline; pub mod xref;)
├── cli/
│   ├── mod.rs           (~300 lines — Cli + Commands only, args moved out)
│   ├── autocorrect.rs   (unchanged)
│   ├── progress.rs      (unchanged)
│   ├── robot.rs         (unchanged)
│   └── commands/
│       ├── mod.rs       (re-exports + WhoArgs, IssuesArgs, etc.)
│       ├── (all existing files — unchanged but with args structs moved in)
│       └── ...
├── core/                (slimmed: 14 files → infrastructure only)
│   ├── mod.rs
│   ├── backoff.rs
│   ├── config.rs
│   ├── db.rs
│   ├── error.rs
│   ├── lock.rs
│   ├── logging.rs
│   ├── metrics.rs
│   ├── paths.rs
│   ├── project.rs
│   ├── shutdown.rs
│   └── time.rs
├── timeline/            (NEW — extracted from core/)
│   ├── mod.rs           (types from core/timeline.rs)
│   ├── seed.rs          (from core/timeline_seed.rs)
│   ├── expand.rs        (from core/timeline_expand.rs)
│   └── collect.rs       (from core/timeline_collect.rs)
├── xref/                (NEW — extracted from core/)
│   ├── mod.rs
│   ├── note_parser.rs   (from core/note_parser.rs)
│   └── references.rs    (from core/references.rs)
├── ingestion/           (gains 4 files from core/)
│   ├── (existing files...)
│   ├── events_db.rs     (from core/events_db.rs)
│   ├── dependent_queue.rs (from core/dependent_queue.rs)
│   ├── payloads.rs      (from core/payloads.rs)
│   └── sync_run.rs      (from core/sync_run.rs)
├── documents/           (unchanged)
├── embedding/           (unchanged)
├── gitlab/              (unchanged)
└── search/              (unchanged)

Import Change Tracking

Tier 1.1: Timeline extraction

Consumer file Old import New import
cli/commands/timeline.rs:10-15 crate::core::timeline::* crate::timeline::*
cli/commands/timeline.rs:13 crate::core::timeline_collect::collect_events crate::timeline::collect_events (or crate::timeline::collect::collect_events)
cli/commands/timeline.rs:14 crate::core::timeline_expand::expand_timeline crate::timeline::expand_timeline
cli/commands/timeline.rs:15 crate::core::timeline_seed::seed_timeline crate::timeline::seed_timeline
core/timeline_seed.rs:7-8 super::timeline::* super::* (or crate::timeline::* depending on structure)
core/timeline_expand.rs:6 super::timeline::* super::*
core/timeline_collect.rs:4 super::timeline::* super::*
core/timeline_seed.rs:8 crate::search::* crate::search::* (no change)
core/timeline_seed.rs:6-7 super::error::Result crate::core::error::Result
core/timeline_expand.rs:5 super::error::Result crate::core::error::Result
core/timeline_collect.rs:3 super::error::* crate::core::error::*

Tier 1.2: Cross-reference extraction

Consumer file Old import New import
ingestion/orchestrator.rs:10-12 crate::core::references::* crate::xref::references::*
core/note_parser.rs:7-8 super::error::Result, super::time::now_ms crate::core::error::Result, crate::core::time::now_ms
core/references.rs:4-5 super::error::Result, super::time::now_ms crate::core::error::Result, crate::core::time::now_ms

Tier 2.1: Ingestion-adjacent DB ops

Consumer file Old import New import
cli/commands/count.rs:9 crate::core::events_db::* crate::ingestion::events_db::*
ingestion/orchestrator.rs:6-8 crate::core::dependent_queue::* super::dependent_queue::*
main.rs:37 crate::core::dependent_queue::release_all_locked_jobs crate::ingestion::dependent_queue::release_all_locked_jobs
ingestion/discussions.rs:7 crate::core::payloads::* super::payloads::*
ingestion/issues.rs:9 crate::core::payloads::* super::payloads::*
ingestion/merge_requests.rs:8 crate::core::payloads::* super::payloads::*
ingestion/mr_discussions.rs:7 crate::core::payloads::* super::payloads::*
cli/commands/sync.rs (uses crate::core::sync_run::*) crate::ingestion::sync_run::*
cli/commands/sync_status.rs (uses crate::core::sync_run::* or crate::core::metrics::*) check and update
Internal: events_db.rs:4-5 super::error::*, super::time::* crate::core::error::*, crate::core::time::*
Internal: dependent_queue.rs:5-6 super::error::Result, super::time::now_ms crate::core::error::Result, crate::core::time::now_ms
Internal: payloads.rs:9-10 super::error::Result, super::time::now_ms crate::core::error::Result, crate::core::time::now_ms
Internal: sync_run.rs:2-4 super::error::*, super::metrics::*, super::time::* crate::core::error::*, crate::core::metrics::*, crate::core::time::*

Execution Order

  1. Tier 1.1 — Extract timeline → src/timeline/ (LOW risk, 1 consumer)
  2. Tier 1.2 — Extract xref → src/xref/ (LOW risk, 1-2 consumers)
  3. Cargo check + clippy + test after each tier
  4. Tier 2.1 — Move ingestion DB ops (MEDIUM risk, more consumers)
  5. Cargo check + clippy + test
  6. Tier 2.2 — Split cli/mod.rs args (MEDIUM risk, mostly mechanical)
  7. Cargo check + clippy + test + fmt

Each tier should be its own commit for easy rollback.


What This Achieves

Before: A developer looking at core/ sees 22 files and has to mentally sort "infrastructure vs. domain logic vs. pipeline stage." The timeline pipeline is invisible unless you know to look in core/.

After:

  • core/ has 12 files, all clearly infrastructure (db, config, error, paths, logging, lock, shutdown, backoff, time, metrics, project)
  • timeline/ is a discoverable first-class module showing the 5-stage pipeline
  • xref/ makes the cross-reference extraction domain visible
  • ingestion/ contains everything related to data fetching: the orchestrator, entity ingestors, AND their supporting DB operations
  • cli/mod.rs is lean — just the top-level Cli struct and Commands enum

A new developer (or coding agent) can now answer "where is the timeline code?" → src/timeline/, "where is ingestion?" → src/ingestion/, "where is cross-reference extraction?" → src/xref/, without needing institutional knowledge.