Algorithms¶
Flux supports multiple reinforcement learning algorithms for LLM post-training. This guide helps you choose the right algorithm for your use case.
Algorithm Overview¶
graph TD
subgraph OnPolicy["On-Policy Algorithms"]
PPO[PPO]
GRPO[GRPO]
REINFORCE[REINFORCE]
DAPO[DAPO]
RLOO[RLOO]
GSPO[GSPO]
end
subgraph Preference["Preference-Based"]
DPO[DPO]
end
Reward[Reward Model] --> OnPolicy
Preferences[Preference Pairs] --> Preference
Quick Comparison¶
| Algorithm | Type | Stability | Efficiency | Best For |
|---|---|---|---|---|
| PPO | On-policy | ★★★★★ | ★★★☆☆ | General, stable training |
| GRPO | On-policy | ★★★★☆ | ★★★★★ | Multi-sample tasks |
| DPO | Preference | ★★★★☆ | ★★★★☆ | Preference data |
| REINFORCE | On-policy | ★★★☆☆ | ★★★☆☆ | Simple baseline |
| DAPO | On-policy | ★★★★★ | ★★★☆☆ | High-variance rewards |
| RLOO | On-policy | ★★★★☆ | ★★★★☆ | Variance reduction |
Choosing an Algorithm¶
graph TD
Start[Start] --> Q1{Do you have<br/>preference data?}
Q1 -->|Yes| DPO[Use DPO]
Q1 -->|No| Q2{Can you generate<br/>multiple responses<br/>per prompt?}
Q2 -->|Yes| GRPO[Use GRPO<br/>Recommended]
Q2 -->|No| Q3{Is stability<br/>critical?}
Q3 -->|Yes| PPO[Use PPO]
Q3 -->|No| REINFORCE[Use REINFORCE]
Decision Guide¶
| Your Situation | Recommended Algorithm |
|---|---|
| Starting out, want something that works | GRPO (default) |
| Have preference pairs (chosen/rejected) | DPO |
| Need maximum stability | PPO |
| Working with high-variance rewards | DAPO |
| Want to reduce variance with multiple samples | RLOO |
| Simple baseline for comparison | REINFORCE |
| Large-scale distributed training | GSPO |
Algorithm Details¶
PPO (Proximal Policy Optimization)¶
The classic, battle-tested algorithm for RLHF.
config = FluxConfig(
algorithm="ppo",
algorithm_config={
"clip_ratio": 0.2,
"kl_penalty": 0.1,
"target_kl": 0.01,
}
)
Loss Function:
GRPO (Group Relative Policy Optimization)¶
Default in Flux. Groups multiple responses per prompt and uses relative rankings.
Advantage Estimation:
DPO (Direct Preference Optimization)¶
Directly optimizes preferences without explicit reward model.
Loss Function:
REINFORCE¶
Basic policy gradient with reward-weighted log probabilities.
config = FluxConfig(
algorithm="reinforce",
algorithm_config={
"baseline": "moving_average",
"baseline_decay": 0.99,
}
)
Loss Function:
DAPO (Decoupled Clip and Dynamic Sampling)¶
Advanced PPO variant with separate clipping for positive/negative advantages.
config = FluxConfig(
algorithm="dapo",
algorithm_config={
"clip_ratio_low": 0.2,
"clip_ratio_high": 0.28,
"dynamic_sampling": True,
"token_level_loss": True,
}
)
RLOO (REINFORCE Leave-One-Out)¶
Uses leave-one-out baseline for variance reduction.
Advantage:
Importance Correction¶
All algorithms in Flux support importance weight correction for off-policy data:
# Automatically applied when using async training
importance_weight = clip(
exp(current_logprobs - behavior_logprobs),
min=0.5,
max=2.0
)
# Staleness decay
staleness_weight = decay ** version_gap
# Final weight
weight = importance_weight * staleness_weight
This allows algorithms to work correctly even when training on slightly stale data.
Custom Algorithms¶
Flux makes it easy to add custom algorithms using the registry pattern:
from flux.training.algorithms.base import (
register_adv_estimator,
register_policy_loss,
)
@register_adv_estimator("my_advantage")
def compute_my_advantage(rewards, mask, **kwargs):
advantages = rewards - rewards.mean()
returns = rewards
return advantages, returns
@register_policy_loss("my_loss")
def compute_my_loss(old_logp, logp, advantages, mask, **kwargs):
ratio = torch.exp(logp - old_logp)
loss = -(ratio * advantages * mask).sum() / mask.sum()
metrics = {"loss": loss.item()}
return loss, metrics
# Use in config
config = FluxConfig(
algorithm="my_loss",
algorithm_config={
"advantage_estimator": "my_advantage",
}
)
Hyperparameter Recommendations¶
Learning Rate¶
| Model Size | Recommended LR |
|---|---|
| < 1B | 1e-5 to 5e-6 |
| 1B - 10B | 5e-6 to 1e-6 |
| 10B - 70B | 1e-6 to 5e-7 |
| > 70B | 5e-7 to 1e-7 |
Clip Ratio (PPO/DAPO)¶
0.1: Conservative, very stable0.2: Standard, good balance0.3: Aggressive, faster learning
Group Size (GRPO/RLOO)¶
4: Good cost/benefit balance8: Better estimates, 2x compute16: Best estimates, 4x compute
Debugging Tips¶
Loss Not Decreasing¶
- Check reward scale (normalize to [-1, 1])
- Reduce learning rate
- Increase batch size
- Check for NaN in log probs
Training Unstable¶
- Add KL penalty
- Reduce learning rate
- Reduce async ratio
- Use DAPO instead of PPO
Poor Sample Efficiency¶
- Try GRPO or RLOO
- Increase group size
- Enable curriculum learning
- Check reward function quality
Next Steps¶
- PPO Deep Dive - Detailed PPO guide
- GRPO Deep Dive - Detailed GRPO guide
- Custom Algorithms - Create your own
- Tutorials - Hands-on examples