DAPO (Decoupled Clip and Dynamic Sampling)¶
Advanced PPO variant with separate clipping for positive/negative advantages.
Overview¶
DAPO improves on PPO by: 1. Decoupled clipping: Different clip ranges for positive vs negative advantages 2. Dynamic sampling: Adjusts sampling based on advantage magnitude 3. Token-level loss: Per-token weighting
Key Innovations¶
Decoupled Clipping¶
For positive advantages (good actions):
clip_high = 1 + ε_high (allow more increase)
For negative advantages (bad actions):
clip_low = 1 - ε_low (allow more decrease)
This asymmetry lets the model learn more from positive examples.
Configuration¶
algorithm:
name: dapo
clip_ratio_low: 0.2 # Clip for negative advantages
clip_ratio_high: 0.28 # Clip for positive advantages
dynamic_sampling: true # Enable dynamic sampling
token_level_loss: true # Per-token weighting
entropy_coef: 0.01
When to Use¶
Best for: - High variance reward functions - When PPO training is unstable - Fine-grained control needed
Compared to PPO: - More stable with noisy rewards - Better sample efficiency - Slightly more complex
Usage¶
from flux import FluxConfig, FluxTrainer
config = FluxConfig(
model_path="Qwen/Qwen3-8B",
algorithm="dapo",
algorithm_config={
"clip_ratio_low": 0.2,
"clip_ratio_high": 0.28,
"dynamic_sampling": True,
}
)
trainer = FluxTrainer(config)
trainer.fit(prompts="data.jsonl")
Key Parameters¶
| Parameter | Default | Description |
|---|---|---|
clip_ratio_low |
0.2 |
Clip for negative adv |
clip_ratio_high |
0.28 |
Clip for positive adv |
dynamic_sampling |
true |
Enable dynamic sampling |
token_level_loss |
true |
Per-token weighting |
See Also¶
- PPO - Simpler baseline
- GRPO - Group-based alternative
- Algorithms Overview