Fine-tuning with DPO¶

Learn how to use Direct Preference Optimization (DPO) for training with preference data.

Time: 30 minutes Prerequisites: Basic RLHF Training, Preference dataset

Overview¶

DPO (Direct Preference Optimization) is an alternative to reward-model-based RLHF. Instead of training a reward model, DPO directly optimizes the policy using preference pairs.

In this tutorial, you'll learn:

What DPO is and when to use it
Preparing preference data
Configuring DPO training
Running and evaluating DPO

When to Use DPO¶

Use DPO when...	Use GRPO/PPO when...
You have preference pairs	You have a reward model
Simple setup preferred	Maximum flexibility needed
Stable training important	Higher throughput needed

Preparing Preference Data¶

DPO requires pairs of (chosen, rejected) responses:

prepare_dpo_data.py

import json

# Preference pairs
preferences = [
    {
        "prompt": "Explain quantum computing.",
        "chosen": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, unlike classical bits which are either 0 or 1. This allows quantum computers to solve certain problems much faster.",
        "rejected": "Quantum computing is complicated. It uses quantum stuff to compute things faster. I don't really understand it myself."
    },
    {
        "prompt": "Write a professional email.",
        "chosen": "Dear Mr. Smith,\n\nThank you for your inquiry. I would be happy to schedule a meeting at your earliest convenience.\n\nBest regards,\nJohn",
        "rejected": "hey john whats up, yeah we can meet whenever lol"
    },
    # Add more pairs...
]

# Save to JSONL
with open("preferences.jsonl", "w") as f:
    for p in preferences:
        f.write(json.dumps(p) + "\n")

print(f"Created {len(preferences)} preference pairs")

Configuration¶

dpo-config.yaml

model_path: Qwen/Qwen3-8B
output_dir: ./outputs/dpo

sglang:
  base_url: http://localhost:8000

# Training settings
num_steps: 1000
batch_size: 8
learning_rate: 5.0e-7  # Lower LR for DPO

# DPO algorithm
algorithm:
  name: dpo
  beta: 0.1              # Temperature parameter
  reference_free: false  # Use reference model
  label_smoothing: 0.0   # Optional smoothing

# DPO doesn't need async (no rollouts)
adaptive_async:
  enabled: false

checkpoint:
  save_steps: 200

Running DPO Training¶

train_dpo.py

from flux import FluxConfig, FluxTrainer

# Load config
config = FluxConfig.from_yaml("dpo-config.yaml")

# Create trainer
trainer = FluxTrainer(config)

# Train with preference data
result = trainer.fit(
    prompts="preferences.jsonl",
    data_format="preferences",  # Indicates DPO format
)

print(f"Training complete!")
print(f"Final loss: {result.final_loss:.4f}")

DPO Loss Function¶

DPO optimizes the following objective:

\[ \mathcal{L}_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right) \]

Where: - \(y_w\) = chosen (winning) response - \(y_l\) = rejected (losing) response - \(\beta\) = temperature (controls strength) - \(\pi_{ref}\) = reference model (original)

Key Parameters¶

Beta (Temperature)¶

Controls how strongly to prefer chosen over rejected:

Beta	Effect
0.01	Very weak preference learning
0.1	Standard (recommended)
0.5	Strong preference learning
1.0	Very strong, may be unstable

algorithm:
  name: dpo
  beta: 0.1  # Start here

Reference-Free DPO¶

Skip the reference model (simpler but less stable):

algorithm:
  name: dpo
  reference_free: true

Monitoring DPO Training¶

@trainer.add_step_callback
def log_dpo_metrics(result):
    metrics = result.metrics
    print(f"Step {result.step}: "
          f"loss={metrics['loss']:.4f}, "
          f"chosen_reward={metrics.get('chosen_reward', 0):.3f}, "
          f"rejected_reward={metrics.get('rejected_reward', 0):.3f}, "
          f"reward_margin={metrics.get('reward_margin', 0):.3f}")

What to Look For¶

Metric	Good Sign	Warning Sign
`loss`	Decreasing	Stuck or increasing
`reward_margin`	Increasing (> 0)	Negative or decreasing
`chosen_reward`	Higher than rejected	Similar to rejected

Evaluation¶

After training, compare chosen vs rejected preference:

from flux import FluxTrainer

trainer = FluxTrainer(config)
trainer.load_checkpoint("outputs/dpo/best")

# Test prompts
test_prompt = "Explain machine learning."

# Generate response
response = trainer.generate(test_prompt)
print(f"Trained model: {response}")

Troubleshooting¶

Loss not decreasing

Solutions: - Increase beta (0.1 → 0.2) - Reduce learning rate - Check data quality (clear chosen/rejected distinction)

Model degrades

Solutions: - Add KL regularization - Use reference model (reference_free=false) - Reduce training steps

Overfitting to preferences

Solutions: - Add more diverse preference pairs - Reduce number of epochs - Use label smoothing

algorithm:
  label_smoothing: 0.1

DPO vs RLHF Comparison¶

Aspect	DPO	GRPO/PPO
Data needed	Preference pairs	Prompts + reward model
Training complexity	Lower	Higher
Compute cost	Lower	Higher (needs rollouts)
Flexibility	Lower	Higher
Stability	Higher	Depends on reward

Next Steps¶

Custom Rewards - For RLHF-style training
Algorithms Guide - DPO deep dive
Production Deployment - Scale up