My recent research has focused on the manual design or learning of reward functions. I’m specifically interested in reward functions that are aligned with human stakeholders’ interests. The publications that have resulted fall within two categories.
Manual reward function design
We investigate how experts tend to design reward functions in practice (by hand) and how they should do so.
- Reward (Mis)design for Autonomous Vehicles (AIJ 2023; arxiv 2021)
- The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications (AAAI 2023)
- How to Specify Reinforcement Learning Objectives (2024)
Inferring reward functions from human input
This RLHF research is particularly focused on the human part, such critiquing assumptions about why people give certain preference labels.
- The EMPATHIC framework for task learning from implicit human feedback (CoRL 2020)
- Models of human preference for learning reward functions (TMLR 2024; arxiv 2022)
- Learning Optimal Advantage from Preferences and Mistaking it for Reward (AAAI 2024; arxiv 2023)
- Contrastive Preference Learning: Learning from Human Feedback without RL (ICLR 2024; arxiv 2023) <– reward specification is only implicit
- Influencing Humans to Conform to Preference Models for RLHF (arxiv 2025)