Skip to content

PPO (Proximal Policy Optimization)

The classic, battle-tested algorithm for RLHF training.

Overview

PPO prevents large policy updates using a clipped surrogate objective, ensuring stable training even with noisy rewards.

Loss Function

\[ L^{CLIP} = -\mathbb{E}\left[\min\left(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\right] \]

Where: - \(r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\) is the probability ratio - \(A_t\) is the advantage estimate - \(\epsilon\) is the clip range (default: 0.2)

Configuration

algorithm:
  name: ppo
  clip_range: 0.2         # Clipping parameter
  clip_range_vf: null     # Value function clip (optional)
  entropy_coef: 0.01      # Entropy bonus
  value_coef: 0.5         # Value loss weight
  kl_coef: 0.0            # KL penalty (optional)
  kl_target: 0.01         # Target KL for adaptive
  gae_lambda: 0.95        # GAE lambda
  gamma: 1.0              # Discount factor

When to Use

Best for: - Maximum training stability - When you have a value function - General-purpose RLHF

Compared to GRPO: - More stable but slower - Requires value function training - Better for single-sample generation

Usage

from flux import FluxConfig, FluxTrainer

config = FluxConfig(
    model_path="Qwen/Qwen3-8B",
    algorithm="ppo",
    algorithm_config={
        "clip_range": 0.2,
        "entropy_coef": 0.01,
        "kl_coef": 0.1,
    }
)

trainer = FluxTrainer(config)
trainer.fit(prompts="data.jsonl")

Key Parameters

Parameter Default Description
clip_range 0.2 Clipping for ratio
entropy_coef 0.01 Entropy bonus
kl_coef 0.0 KL penalty weight
gae_lambda 0.95 GAE parameter

See Also