Nexi – Brad Knox, PhD

Videos of TAMER-based training of interactive robot navigation can be found here.

The teaching strategy of painting

As referenced in our ICSR 2013 paper, Training a Robot via Human Feedback: A Case Study, the training strategy evolved while teaching the 5 different task behaviors to Nexi by TAMER. We refer to the final strategy as “painting”.

Here we explain this teaching strategy, which specifically takes advantage of the interactivity of the task to to be less likely to lead to a failed training session and more efficient overall. The concept of painting is best understood by watching the beginning of the video of training the keep conversational distance behavior, which shows a representation of the human reward model and incoming reward instances throughout training. We note that changes in teaching strategy only involve changes in the reward given by the trainer; the agent’s algorithm is unchanged.

The trainer’s initial strategy was to position the artifact and wait for the robot’s randomly chosen initial action. If this initial action was correct for the current artifact position, then positive reward was given. If it was incorrect, negative reward was given until the correct action was chosen. When a previously “punished” action was later desired, the prediction of human reward was often so low that the robot would not try the action until all other actions were predicted to receive even lower rewards. This downward spiral towards negative predictions would quickly reach a point where the trainer could not elicit an action. We partially addressed this issue through a type of distance-based biasing (described at the beginning of Section 4 of the ICSR 2013 paper), which reduces the strength of generalization in Ĥ; however, a number of negative rewards during early training could still be problematic.

As these training sessions progressed, a different teaching strategy emerged that we call “painting” that both avoids the dangers of giving too much early negative reward and was subjectively experienced to reduce the time needed to teach. In this strategy, the trainer positively rewards the initial action and then it changes the state to make the initial action correct with respect to the desired final behavior. Since all actions other than the initial one elicit reward predictions of 0 in all states, the positive predictions for the initial action cause the robot to continue to follow that action. (For this specific implementation, the robot will continue to choose its initial action until receiving feedback, so the initial positive reward is unnecessary. However, we present a version of painting that is effective with this implementation detail and—we believe—without it.) Moving the artifact around through all of these correct states, the trainer “paints” this region of states with much positive reward, further reinforcing the action in a widening range of states. After fully painting the state space where the first action is correct—giving a strong positive bias to predictions of reward for that action in any state—the trainer then changes the state to make the action incorrect and gives a few negative reward signals. In response to this negative reward, the agent then tries a new action and the trainer again paints the state space where this action is correct, proceeding in this manner until all actions have been painted. At this point, the trainer fine-tunes the taught behavior through a less-specified strategy that mixes some aspects of painting with the original teaching strategy. This positive-reward painting strategy relies on the presence of interactive features, which allow the trainer to change the state to make an on-going action correct or incorrect.

In the animal training literature, a trainer captures behavior by simply waiting for the behavior to occur naturally, rewarding the behavior to make it reoccur, and then adding context like a spoken command (e.g. “down”) or a gesture. To help the animal learn the correct context, reward is lowered for less contextually correct versions of the behavior. Painting closely resembles this capturing strategy.