Design Documentation¶

Deep dive into the technical design and architecture decisions behind Flux.

Overview¶

Flux is designed with three core principles:

Adaptive by Default - Every parameter that could benefit from adaptation should be adaptive
Native Performance - Direct integration with Megatron and SGLang, no abstraction overhead
Simple Codebase - Less than 5,000 lines of core framework code

Documentation¶

Design Philosophy

The principles and decisions that shaped Flux

Philosophy
Technical Specification

Detailed technical specification document

Specification
Framework Comparison

How Flux compares to VERL, AReaL, and Slime

Comparison
Roadmap

Future development plans

Roadmap

Key Design Decisions¶

Why Not Ray?¶

After analyzing VERL, AReaL, and Slime, we concluded that Ray adds unnecessary overhead for LLM training:

Ray Provides	LLM Training Needs	Mismatch
Task-level scheduling	NCCL collectives	Ray doesn't understand NCCL
Actor lifecycle	Fine-grained GPU control	Ray treats GPU as "count"
Flexible placement	Fixed TP/PP/DP config	Changing requires restart
Object Store	CUDA IPC / NCCL	Serialization is slow

Result: Both VERL and Slime bypass Ray for critical paths. If you're bypassing the framework, why use it?

Why Adaptive Async?¶

The sync vs async choice is typically presented as binary, but it's actually a spectrum:

Sync ◄──────────────────────────────────────────────► Async
      │                                              │
      │  Low throughput                High staleness │
      │  High stability               Low stability  │
      │                                              │
      └──────────────────────────────────────────────┘

Key insight: The optimal position on this spectrum changes during training:

Early training: Policy changing fast → need fresh data → more sync
Mid training: Balanced exploration/exploitation → balanced async
Late training: Fine-tuning → stable policy → can tolerate more async

Why Unified Importance Correction?¶

Multiple sources of distribution shift:

Staleness: Data from old policy versions
Trajectory inconsistency: Mixed versions within trajectory
Replay: Data reused from buffer

All are forms of the same problem: \(\pi_{current} \neq \pi_{behavior}\)

Solution: Single importance weight formula that handles all cases:

\[w = \frac{\pi_{current}(a|s)}{\pi_{behavior}(a|s)} \cdot \gamma^{version\_gap}\]

Architecture Layers¶

┌─────────────────────────────────────────────────────────────┐
│                  Layer 3: Adaptive Control                   │
│  • PID Controller    • Smart Batch Composer                  │
│  • Staleness Monitor • APRIL Manager                         │
├─────────────────────────────────────────────────────────────┤
│                 Layer 2: Coordinator                         │
│  • asyncio event loop • ZeroMQ/gRPC comm                     │
│  • Weight sync orchestration • Checkpoint management         │
├─────────────────────────────────────────────────────────────┤
│                Layer 1: Native Engines                       │
│  • Megatron-LM (training)  • SGLang (inference)              │
│  • CUDA IPC (weight sync)  • NCCL (gradients)                │
└─────────────────────────────────────────────────────────────┘

Layer 1: Native Execution Engines¶

Direct integration with battle-tested systems:

Megatron-LM: Industry-standard distributed training
SGLang: State-of-the-art LLM serving
CUDA IPC: Zero-copy weight transfer

Layer 2: Lightweight Coordinator¶

Minimal orchestration layer:

Pure Python asyncio (no Ray overhead)
ZeroMQ for local communication
gRPC for remote communication

Layer 3: Adaptive Control Plane¶

The "brain" of Flux:

PID controller for async ratio
Staleness measurement and monitoring
Smart batch composition

Performance Targets¶

Metric	Target	Measurement
GPU Utilization	> 80%	nvidia-smi average
Throughput	2x VERL	samples/hour
Staleness	Mean < 0.2	Combined metric
KL Blow-up	< 5% of runs	Detection rate
Scaling	> 85% at 64 GPUs	Linear efficiency

Trade-offs¶

Simplicity vs Flexibility¶

We chose simplicity:

Fixed Megatron + SGLang stack (no pluggable backends)
Single deployment topology per run
Configuration over code for most use cases

Benefit: Easier to understand, debug, and maintain

Stability vs Throughput¶

We chose adaptive:

Neither pure sync nor pure async
Target staleness as control variable
Automatic adjustment based on training dynamics

Benefit: Best of both worlds for most workloads