DPO (Direct Preference Optimization)¶

Train directly from preference pairs without a reward model.

Overview¶

DPO bypasses reward modeling by directly optimizing the policy using preference pairs (chosen vs rejected responses).

Loss Function¶

\[ L_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right) \]

Where: - \(y_w\) = chosen (winning) response - \(y_l\) = rejected (losing) response - \(\beta\) = temperature parameter - \(\pi_{ref}\) = reference (original) policy

Configuration¶

algorithm:
  name: dpo
  beta: 0.1               # Temperature parameter
  reference_free: false   # Use reference model
  label_smoothing: 0.0    # Optional smoothing

Data Format¶

DPO requires preference pairs:

{
  "prompt": "Explain quantum computing.",
  "chosen": "Quantum computing uses qubits...",
  "rejected": "I don't really understand quantum stuff..."
}

When to Use¶

Best for: - When you have preference data - Simpler training pipeline - No reward model needed

Compared to GRPO/PPO: - Simpler setup - No rollout generation - Requires paired preferences

Usage¶

from flux import FluxConfig, FluxTrainer

config = FluxConfig(
    model_path="Qwen/Qwen3-8B",
    algorithm="dpo",
    algorithm_config={
        "beta": 0.1,
    }
)

trainer = FluxTrainer(config)
trainer.fit(
    prompts="preferences.jsonl",
    data_format="preferences",
)

Key Parameters¶

Parameter	Default	Description
`beta`	`0.1`	Temperature (higher = stronger)
`reference_free`	`false`	Skip reference model
`label_smoothing`	`0.0`	Smooth labels