DPO (Direct Preference Optimization)¶
Train directly from preference pairs without a reward model.
Overview¶
DPO bypasses reward modeling by directly optimizing the policy using preference pairs (chosen vs rejected responses).
Loss Function¶
\[
L_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)
\]
Where: - \(y_w\) = chosen (winning) response - \(y_l\) = rejected (losing) response - \(\beta\) = temperature parameter - \(\pi_{ref}\) = reference (original) policy
Configuration¶
algorithm:
name: dpo
beta: 0.1 # Temperature parameter
reference_free: false # Use reference model
label_smoothing: 0.0 # Optional smoothing
Data Format¶
DPO requires preference pairs:
{
"prompt": "Explain quantum computing.",
"chosen": "Quantum computing uses qubits...",
"rejected": "I don't really understand quantum stuff..."
}
When to Use¶
Best for: - When you have preference data - Simpler training pipeline - No reward model needed
Compared to GRPO/PPO: - Simpler setup - No rollout generation - Requires paired preferences
Usage¶
from flux import FluxConfig, FluxTrainer
config = FluxConfig(
model_path="Qwen/Qwen3-8B",
algorithm="dpo",
algorithm_config={
"beta": 0.1,
}
)
trainer = FluxTrainer(config)
trainer.fit(
prompts="preferences.jsonl",
data_format="preferences",
)
Key Parameters¶
| Parameter | Default | Description |
|---|---|---|
beta |
0.1 |
Temperature (higher = stronger) |
reference_free |
false |
Skip reference model |
label_smoothing |
0.0 |
Smooth labels |