GRPO (Group Relative Policy Optimization)¶

The default algorithm in Flux. Groups multiple responses per prompt and uses relative rankings for advantage estimation.

Overview¶

GRPO normalizes rewards within groups of responses to the same prompt, providing more stable training signal than per-sample reward.

Key idea: Instead of using absolute rewards, use relative rewards within a group.

How It Works¶

graph LR
    A[Prompt] --> B[Generate N responses]
    B --> C[Compute rewards]
    C --> D[Normalize within group]
    D --> E[Train on normalized advantages]

Generate group_size responses for each prompt
Compute rewards for all responses
Normalize within each group: \(A_i = \frac{r_i - \mu}{\sigma}\)
Train on normalized advantages

Advantage Formula¶

\[ A_i = \frac{r_i - \mu_{group}}{\sigma_{group} + \epsilon} \]

Where: - \(r_i\) = reward for response \(i\) - \(\mu_{group}\) = mean reward in the group - \(\sigma_{group}\) = std deviation in the group - \(\epsilon\) = small constant for stability

Loss Function¶

Combines PPO-style clipping with KL regularization:

\[ L = -\text{clip}(r_t, 1-\epsilon, 1+\epsilon) \cdot A + \beta \cdot KL \]

Configuration¶

algorithm:
  name: grpo
  clip_range: 0.2       # PPO clip range
  entropy_coef: 0.01    # Entropy bonus
  normalize_advantages: true

Key Parameters¶

Parameter	Default	Description
`clip_range`	`0.2`	Clipping for policy ratio
`beta`	`0.01`	KL penalty coefficient
`group_size`	`4`	Responses per prompt
`normalize_by_std`	`true`	Divide by group std

Usage¶

from flux import FluxConfig, FluxTrainer

config = FluxConfig(
    model_path="Qwen/Qwen3-8B",
    algorithm="grpo",
    algorithm_config={
        "clip_range": 0.2,
        "group_size": 4,
    }
)

trainer = FluxTrainer(config)
trainer.fit(prompts="data.jsonl")

When to Use¶

Best for: - Multi-sample generation tasks - High variance rewards - When you can afford 4+ generations per prompt

Advantages over PPO: - More stable gradients - Better sample efficiency - Less sensitive to reward scale

Vectorized Implementation¶

For efficiency, Flux includes a vectorized variant:

@register_adv_estimator("grpo_vectorized")
def compute_grpo_advantage_vectorized(
    token_level_rewards: torch.Tensor,
    response_mask: torch.Tensor,
    group_size: int = 4,
    **kwargs,
) -> tuple[torch.Tensor, torch.Tensor]:
    # Assumes batch is pre-organized as consecutive groups
    ...

This is ~2x faster when batches are already structured.

Metrics¶

GRPO logs these metrics:

Metric	Description
`actor/grpo_kl`	Approximate KL divergence
`actor/clip_frac`	Fraction of clipped samples
`actor/entropy`	Policy entropy
`actor/adv_mean`	Mean advantage
`actor/adv_std`	Advantage std
`actor/ratio_mean`	Mean policy ratio