Debug Training Issues¶
Common issues and how to fix them.
Loss Not Decreasing¶
Symptoms: Loss stays flat or increases
Causes & Solutions:
| Cause | Solution |
|---|---|
| Learning rate too high | Reduce by 5-10x |
| Learning rate too low | Increase by 2-5x |
| Reward function broken | Check reward distribution |
| Batch size too small | Increase or use grad accumulation |
Debugging:
@trainer.add_step_callback
def debug_loss(result):
print(f"Step {result.step}:")
print(f" Loss: {result.metrics.get('loss', 'N/A')}")
print(f" Grad norm: {result.metrics.get('grad_norm', 'N/A')}")
High Staleness¶
Symptoms: staleness > 0.3 consistently
Solutions:
adaptive_async:
max_async_ratio: 0.5 # Reduce async
target_staleness: 0.1 # Lower target
weight_sync:
sync_interval: 1 # Sync every step
Out of Memory¶
Symptoms: CUDA OOM
Solutions:
- Reduce batch size
- Enable gradient checkpointing
- Use smaller model
- Increase tensor parallelism
NaN in Loss¶
Symptoms: Loss becomes NaN
Solutions:
- Reduce learning rate
- Add gradient clipping
- Check for log(0) in rewards
Model Degrades¶
Symptoms: Outputs get worse over training
Solutions:
- Add KL penalty
- Reduce learning rate
- Check for reward hacking
Slow Training¶
Symptoms: Low throughput
Check:
Solutions:
- Increase async ratio
- Use more workers
- Check network bottlenecks