My recent research has focused on the manual design or learning of reward functions. I’m specifically interested in reward functions that are aligned with human stakeholders’ interests. The publications that have resulted fall within two categories.

Manual reward function design
We investigate how experts tend to design reward functions in practice (by hand) and how they should do so.

Inferring reward functions from human input
This RLHF research is particularly focused on the human part, such critiquing assumptions about why people give certain preference labels.