By W. Bradley Knox and James MacGlashan

Find the paper here. Arxiv version coming soon.

We discuss how practically to specify reinforcement learning (RL) objectives through careful design of reward functions and discounting. We specifically focus on defining a human-aligned objective for the RL problem, and we argue that reward shaping and decreasing discounting, if desired, are part of the RL solution—not the problem—and should be saved for a second step after this paper’s focus. We provide tools for diagnosing misalignment in RL objectives, such as finding preference mismatches between the RL objective and human judgments and examining the indifference point between risky and safe trajectory lotteries. We discuss common pitfalls that can lead to misalignment, including naive reward shaping, trial-and-error reward tuning, and improper handling of discount factors. We also sketch candidate best practices for designing interpretable, aligned RL objectives and discuss open problems that hinder the design of aligned RL objectives in practice.