My recent research has focused on the manual design or learning of reward functions. I’m specifically interested in reward functions that are aligned with human stakeholders’ interests. The publications that have resulted fall within two categories.

Manual reward function design
We investigate how experts tend to design reward functions in practice (by hand) and how they should do so.

In contrast to the recently pervasive concept of reward functions that are myopically optimized in sequential tasks (mainly for LLM fine-tuning via RLHF)—this work focuses on reward that is added up over multiple actions for non-myopic optimization. Such non-myopic optimization allows learning superhuman task performance, as AlphaGo demonstrated. This work also focuses more on so-called environment reward functions, each of which exists before any reward shaping and communicates a human-aligned task performance metric.

Inferring reward functions from human input
This RLHF research is particularly focused on the human part, such critiquing assumptions about why people give certain preference labels.