Check this out: https://cs.stanford.edu/people/karpathy/reinforcejs/
You can see how a few different example RL algorithms work. I particularly like the Gridworld TD example - try changing the epsilon parameter and see how it effects learning. Also check out the Waterworld example. Write any interesting observations from this as a comment on this recipe!