Skip to content

Flux

Optimize Performance

Optimize Performance¶

Tips for maximizing training throughput.

Measure Baseline¶

@trainer.add_step_callback
def log_perf(result):
    print(f"Throughput: {result.metrics.get('throughput', 0):.1f} samples/s")

GPU Utilization¶

watch -n 1 nvidia-smi
# Target: >80% GPU utilization

Key Optimizations¶

1. Increase Batch Size¶

batch_size: 64
gradient_accumulation_steps: 4  # Effective: 256

2. Tune Async Ratio¶

adaptive_async:
  target_staleness: 0.2
  max_async_ratio: 0.8

3. Enable Flash Attention¶

megatron:
  use_flash_attention: true

4. Use BF16¶

megatron:
  bf16: true
  fp16: false

5. Optimize Weight Sync¶

weight_sync:
  method: delta
  use_cuda_ipc: true

6. Tune APRIL¶

rollout:
  oversample_ratio: 1.5
  batch_timeout: 20.0

Profiling¶

from torch.profiler import profile

with profile() as prof:
    trainer.fit(prompts, num_steps=10)

print(prof.key_averages().table(sort_by="cuda_time_total"))

See Also¶