Custom Reward Functions¶
Learn how to build reward functions tailored to your specific training objectives.
Time: 20 minutes Prerequisites: Basic RLHF Training
Overview¶
Reward functions are the heart of RLHF training. They define what "good" means for your model. In this tutorial, you'll learn:
- How reward functions work in Flux
- Built-in reward functions
- Creating custom reward functions
- Combining multiple rewards
- Best practices and debugging
How Rewards Work¶
graph LR
A[Trajectory] --> B[Reward Function]
B --> C[Score 0-1]
C --> D[Policy Update]
A reward function takes a Trajectory (prompt + response) and returns a score:
from flux.rewards import RewardFunction, RewardOutput
from flux.core.trajectory import Trajectory
class MyReward(RewardFunction):
def compute_reward(self, trajectory: Trajectory) -> RewardOutput:
# Access the response
response = trajectory.response
# Compute your score (0 to 1)
score = self._compute_score(response)
return RewardOutput(reward=score)
Built-in Rewards¶
Flux provides several ready-to-use reward functions:
LengthReward¶
Rewards responses of a target length:
from flux.rewards import LengthReward
# Reward responses around 200 words
reward = LengthReward(target_length=200, reward_type="gaussian")
FormatReward¶
Rewards structured responses:
from flux.rewards import FormatReward
reward = FormatReward(
required_sections=["Introduction", "Conclusion"],
forbidden_patterns=["I don't know", "As an AI"],
)
KeywordReward¶
Rewards presence of specific keywords:
from flux.rewards import KeywordReward
reward = KeywordReward(
required_keywords=["because", "therefore"],
bonus_keywords=["example", "specifically"],
penalty_keywords=["maybe", "perhaps"],
)
FunctionReward¶
Quick custom reward from a function:
from flux.rewards import FunctionReward
def my_scorer(trajectory):
return 1.0 if "answer" in trajectory.response.lower() else 0.0
reward = FunctionReward(fn=my_scorer)
Creating Custom Rewards¶
Basic Custom Reward¶
from flux.rewards import RewardFunction, RewardOutput
from flux.core.trajectory import Trajectory
class PoliteReward(RewardFunction):
"""Reward polite and respectful responses."""
def __init__(self, weight: float = 1.0):
self.weight = weight
self.polite_words = ["please", "thank you", "appreciate", "glad"]
self.rude_words = ["stupid", "dumb", "idiot", "whatever"]
def compute_reward(self, trajectory: Trajectory) -> RewardOutput:
response = trajectory.response.lower()
# Count polite words
polite_count = sum(1 for w in self.polite_words if w in response)
# Count rude words (negative)
rude_count = sum(1 for w in self.rude_words if w in response)
# Compute score
score = min(1.0, polite_count * 0.2) - rude_count * 0.3
score = max(0.0, min(1.0, score)) # Clamp to [0, 1]
return RewardOutput(
reward=score * self.weight,
metadata={
"polite_count": polite_count,
"rude_count": rude_count,
}
)
Using External Models¶
from flux.rewards import RewardFunction, RewardOutput
from transformers import pipeline
class SentimentReward(RewardFunction):
"""Reward positive sentiment responses."""
def __init__(self):
self.classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
def compute_reward(self, trajectory: Trajectory) -> RewardOutput:
result = self.classifier(trajectory.response[:512])[0]
if result["label"] == "POSITIVE":
score = result["score"]
else:
score = 1.0 - result["score"]
return RewardOutput(reward=score)
Combining Rewards¶
Use CompositeReward to combine multiple reward signals:
from flux.rewards import CompositeReward, LengthReward, KeywordReward
# Combine with weights
reward = CompositeReward([
(LengthReward(target=150), 0.3), # 30% weight
(KeywordReward(required=["because"]), 0.3), # 30% weight
(PoliteReward(), 0.4), # 40% weight
])
The final reward is a weighted sum of all components.
Best Practices¶
1. Normalize to [0, 1]¶
# Good: Normalized score
score = max(0.0, min(1.0, raw_score))
# Bad: Unbounded score
score = raw_score # Could be -100 or +1000
2. Avoid Sparse Rewards¶
# Bad: Binary reward (sparse)
score = 1.0 if perfect_response else 0.0
# Good: Gradual reward (dense)
score = 0.3 * has_greeting + 0.3 * has_content + 0.4 * has_conclusion
3. Log Metadata¶
return RewardOutput(
reward=score,
metadata={
"length": len(response),
"keyword_matches": matches,
"sub_scores": {"format": 0.3, "content": 0.7},
}
)
4. Test Your Reward¶
# Test with example responses
test_cases = [
("Good response with explanation because...", 0.8),
("Bad", 0.1),
("Medium length response", 0.5),
]
for response, expected in test_cases:
traj = Trajectory(response=response)
result = reward.compute_reward(traj)
print(f"Score: {result.reward:.2f} (expected ~{expected})")
Debugging Rewards¶
Check Reward Distribution¶
# During training, log reward statistics
@trainer.add_step_callback
def log_reward_stats(result):
rewards = result.metrics.get("rewards", [])
if rewards:
print(f"Reward: mean={np.mean(rewards):.3f}, "
f"std={np.std(rewards):.3f}, "
f"min={np.min(rewards):.3f}, "
f"max={np.max(rewards):.3f}")
Common Issues¶
| Issue | Symptom | Solution |
|---|---|---|
| Reward hacking | High reward, bad outputs | Add diversity penalties |
| Sparse rewards | Slow learning | Add intermediate rewards |
| High variance | Unstable training | Normalize, reduce LR |
Next Steps¶
- Basic RLHF Training - Use your reward in training
- Multi-GPU Training - Scale up training
- Algorithms Guide - Choose the right algorithm