Reward specification – Brad Knox, PhD

My recent research has focused on the manual design or learning of reward functions. I’m specifically interested in reward functions that are aligned with human stakeholders’ interests. The publications that have resulted fall within two converging threads.

Manual reward function design

We investigate how experts tend to design reward functions in practice (by hand) and how they should do so.

In contrast to the recently pervasive concept of reward functions that are myopically optimized in sequential tasks (mainly for LLM fine-tuning via RLHF)—this work focuses on reward that is added up over multiple actions for non-myopic optimization. Such non-myopic optimization allows learning superhuman task performance, as AlphaGo demonstrated. This work also focuses more on so-called environment reward functions, each of which exists before any reward shaping and communicates a human-aligned task performance metric.

Reward (Mis)design for Autonomous Vehicles (AIJ 2023; arxiv 2021)
The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications (AAAI 2023)
How to Specify Reinforcement Learning Objectives (arxiv 2024)
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners (RLC 2025, Outstanding Paper Award)

Inferring reward functions from human input

This RLHF research is particularly focused on the human part, such critiquing assumptions about why people give certain preference labels.

The EMPATHIC framework for task learning from implicit human feedback (CoRL 2020)
Models of human preference for learning reward functions (TMLR 2024; arxiv 2022)
Learning Optimal Advantage from Preferences and Mistaking it for Reward (AAAI 2024; arxiv 2023)
Contrastive Preference Learning: Learning from Human Feedback without RL (ICLR 2024; arxiv 2023) <– reward specification is only implicit
Influencing Humans to Conform to Preference Models for RLHF (arxiv 2025)