Was struggling to figure out why, even after many minutes and a couple thousand episodes, this RL algorithm I'm trying to implement didn't seem to be improving. Kept checking to see if my code was borked somewhere.

Finally went back to the paper and noticed the time axis of their training graph went from 0 to 200 million episodes lol. Gonna have to burn a lot of watts on this bad boy's education.

