Before training (random behavior):


During training:

A green flash indicates positive reward from the human trainer. Red indicates negative reinforcement.

After 8 episodes (i.e., times to goal) of training:

In these videos, E/S/T/R respectively indicates the episode number, the time step within the current episode, how many total time steps have passed, and the environmental reward for the current time step (as defined by the MDP).