Skip to content

Example Configurations

Copy and modify these templates for common scenarios.

Stable Training (Default)

model_path: Qwen/Qwen3-8B
output_dir: ./outputs

num_steps: 5000
batch_size: 32
learning_rate: 1.0e-6

adaptive_async:
  target_staleness: 0.15
  min_async_ratio: 0.1
  max_async_ratio: 0.7

algorithm:
  name: grpo
  clip_range: 0.2

sglang:
  base_url: http://localhost:8000

High Throughput

model_path: Qwen/Qwen3-8B
output_dir: ./outputs

num_steps: 10000
batch_size: 64
learning_rate: 5.0e-7

adaptive_async:
  target_staleness: 0.25
  max_async_ratio: 0.9

rollout:
  oversample_ratio: 2.0
  batch_timeout: 20.0

Maximum Stability

model_path: Qwen/Qwen3-8B
output_dir: ./outputs

num_steps: 5000
batch_size: 16
learning_rate: 5.0e-7

adaptive_async:
  target_staleness: 0.05
  max_async_ratio: 0.3

algorithm:
  name: ppo
  clip_range: 0.1
  kl_coef: 0.2

DPO Training

model_path: Qwen/Qwen3-8B
output_dir: ./outputs/dpo

num_steps: 1000
batch_size: 8
learning_rate: 5.0e-7

algorithm:
  name: dpo
  beta: 0.1

adaptive_async:
  enabled: false

Multi-GPU (8x H100)

model_path: Qwen/Qwen3-72B
output_dir: ./outputs

num_steps: 10000
batch_size: 128
learning_rate: 1.0e-6

megatron:
  tp_size: 4
  dp_size: 2
  bf16: true
  use_flash_attention: true

adaptive_async:
  target_staleness: 0.15

See Also