Skip to content

Optimize Performance

Tips for maximizing training throughput.

Measure Baseline

@trainer.add_step_callback
def log_perf(result):
    print(f"Throughput: {result.metrics.get('throughput', 0):.1f} samples/s")

GPU Utilization

watch -n 1 nvidia-smi
# Target: >80% GPU utilization

Key Optimizations

1. Increase Batch Size

batch_size: 64
gradient_accumulation_steps: 4  # Effective: 256

2. Tune Async Ratio

adaptive_async:
  target_staleness: 0.2
  max_async_ratio: 0.8

3. Enable Flash Attention

megatron:
  use_flash_attention: true

4. Use BF16

megatron:
  bf16: true
  fp16: false

5. Optimize Weight Sync

weight_sync:
  method: delta
  use_cuda_ipc: true

6. Tune APRIL

rollout:
  oversample_ratio: 1.5
  batch_timeout: 20.0

Profiling

from torch.profiler import profile

with profile() as prof:
    trainer.fit(prompts, num_steps=10)

print(prof.key_averages().table(sort_by="cuda_time_total"))

See Also