Optimize Performance¶
Tips for maximizing training throughput.
Measure Baseline¶
@trainer.add_step_callback
def log_perf(result):
print(f"Throughput: {result.metrics.get('throughput', 0):.1f} samples/s")
GPU Utilization¶
Key Optimizations¶
1. Increase Batch Size¶
2. Tune Async Ratio¶
3. Enable Flash Attention¶
4. Use BF16¶
5. Optimize Weight Sync¶
6. Tune APRIL¶
Profiling¶
from torch.profiler import profile
with profile() as prof:
trainer.fit(prompts, num_steps=10)
print(prof.key_averages().table(sort_by="cuda_time_total"))