Debug Training Issues¶

Common issues and how to fix them.

Loss Not Decreasing¶

Symptoms: Loss stays flat or increases

Causes & Solutions:

Cause	Solution
Learning rate too high	Reduce by 5-10x
Learning rate too low	Increase by 2-5x
Reward function broken	Check reward distribution
Batch size too small	Increase or use grad accumulation

Debugging:

@trainer.add_step_callback
def debug_loss(result):
    print(f"Step {result.step}:")
    print(f"  Loss: {result.metrics.get('loss', 'N/A')}")
    print(f"  Grad norm: {result.metrics.get('grad_norm', 'N/A')}")

High Staleness¶

Symptoms: staleness > 0.3 consistently

Solutions:

adaptive_async:
  max_async_ratio: 0.5  # Reduce async
  target_staleness: 0.1  # Lower target

weight_sync:
  sync_interval: 1  # Sync every step

Out of Memory¶

Symptoms: CUDA OOM

Solutions:

Reduce batch size
Enable gradient checkpointing
Use smaller model
Increase tensor parallelism

batch_size: 16  # Reduce
megatron:
  activation_checkpointing: true
  tp_size: 2  # Increase TP

NaN in Loss¶

Symptoms: Loss becomes NaN

Solutions:

Reduce learning rate
Add gradient clipping
Check for log(0) in rewards

algorithm:
  max_grad_norm: 0.5  # Tighter clipping

Model Degrades¶

Symptoms: Outputs get worse over training

Solutions:

Add KL penalty
Reduce learning rate
Check for reward hacking

algorithm:
  kl_coef: 0.1  # Add KL penalty

Slow Training¶

Symptoms: Low throughput

Check:

watch -n 1 nvidia-smi  # GPU utilization

Solutions:

Increase async ratio
Use more workers
Check network bottlenecks

Debug Training Issues¶

Loss Not Decreasing¶

High Staleness¶

Out of Memory¶

NaN in Loss¶

Model Degrades¶

Slow Training¶

See Also¶