Facebook Pixel Code

This is the eleventh article in a series dedicated to the various aspects of machine learning (ML). Today’s article will provide a general overview of reinforcement learning (RL), which is distinguished from other types of machine learning because of its trial-and-error quality. 

Close your eyes and imagine you’re back in middle school. You are sitting in science class. The teacher started talking 37 minutes ago (you have checked the clock 37 times since class started), and you’re pretty sure the lecture is about something called photosynthesis. You look to the student on your right, the star quarterback of the football team, and he’s fast asleep dreaming of pelting you with dodgeballs in gym class. To your left is the weird kid of the class, and she’s busy tattooing eldritch witchcraft-related symbols on her arm with a Sharpie. It seems like nobody is paying attention, until the teacher starts passing out cups with budding plants in them and tells you all to keep it at home for a week, then come back next week with a daily log that describes how the process of photosynthesis has changed the plant. At this point, you start to wish you listened to the lecture or bought the textbook. 

Let’s also imagine your middle school days were pre-internet, so when you get home with your plant cup you can’t just Google what photosynthesis is. Your parents are normal people, so of course, they’ve forgotten everything they themselves learned in middle school science class. All the friends you call for help weren’t paying attention in the class either, and there’s a sinking-ship feeling of solidarity among everyone about the inevitability of getting an F. You’re on your own for this project, and without a rulebook to guide you. 

This can go any number of ways, but let’s say you put a little bit of water in the cup then put it in your dark closet the whole week. Maybe you take a photo of it every day since “photo” is in the name of photosynthesis. You figure the plant will grow on its own, and you leave it there for a week. The morning that the project is due, you fake a daily log talking about its progress, but when you take the plant out of the closet you’ve discovered it has barely grown. When you show it to the teacher, she gives you an F. Oops. 

You may have failed science class because of your laziness and ineptitude, but you ended up getting a masterclass in reinforcement learning, which is a method of learning used in AI where an agent is “without a rulebook” and must figure out for itself what the best course of action is on its own. 


What Makes Reinforcement Learning Special? 


When it comes to types of learning like clustering, or sitting in class during a lecture, the activity could be described as “passive” in that the agent observes the intake, surveillance, analysis, and judgment of data, which could be simplified to the “input” (e.g., unlabeled pictures of animals, or a pre-photosynthesized plant) and “output” (e.g., grouping pictures of yellow-furred, red-maned animals together into a cluster, or a photosynthesized plant). The learning agent sees what goes in and what comes out, and will use the principles learned by this observation in real-life applications. 


Reinforcement learning, on the other hand, is not so passive—though there is a such thing as passive reinforcement learning!—, in the sense that it relies on the agent seeing the immediate outcomes of their decisions, and deciding without a teacher whether the outcome was a success or failure, and to what degree. When you pulled the barely-grown plant out of your closet, you probably surmised that you didn’t do the project correctly. When your teacher gave you an F, then you knew for sure that you messed up. 


The need for the trial-and-error, thrown-into-the-fire method of reinforcement learning is that the inputs and outputs of every possible situation, like every possible outcome of a chess game, cannot reasonably be envisioned, so the learning agent must be able to deal with problems and find solutions that it has not been explicitly trained on. 


When training an ML agent through RL, the teachers are present but at a distance, offering occasional rewards for agents that manage to do the right thing. For example, your teacher giving you an F was the “reward” for your photosynthesis project. You may not get another shot at the project, but AI agents in training are meant to do tasks over and over again. Any AI agent’s goal is to maximize utility, and more rewards mean higher utility. 

In a previous article, we brought up the example of R2, Domino’s pizza-delivering robot on wheels, being faced with a golden retriever charging at it. We used this example to explain decision trees, and how the ML agent will take the sight of a dog, and its observed environment, and refer to the decision tree for what to do in the situation, leading the agent eventually to swerve out of the way of the dog without hitting a pedestrian. 

With reinforcement learning, the agent doesn’t have the decision tree at its disposal to tell it what the “rules” in the situation are—it needs to find out for itself what is preferred, and this can be discovered after the fact. It can be a tough love situation: Get trucked by the dog once, shame on the dog; get trucked by the dog twice…won’t get fooled again (but if so, then the agent will need some serious reprogramming). So, if the agent manages to swerve out of the way of the dog without hitting or disturbing anyone/thing, and without harming itself or sabotaging the delivery, then the teachers will likely reward it for making the correct move. This instills in the agent the idea that safely evading any oncoming object is the preferable, utility-maximizing decision. 

This may seem kind of tedious and involved, but reward systems can actually be much easier for designers to create an entire labeled system specifying what is and isn’t preferable. Instead, a developer can tell R2 whether it screwed up or did well when facing a dog in just a few lines of code. A decision tree can be complicated, but telling an agent in training that it hit a pedestrian, or fell over, or hurt the dog by playing chicken with it, and thus failed the task, can be done in an easier way. The agent’s journey to learning can be a bit messier, but it saves the developers plenty of headaches and time. 


However, the agent becomes a better agent if the rewards are sparse rewards, meaning that they only occasionally pop up in training. In the real world, nobody is going to be giving the agent metaphorical candy for swerving in the correct direction. However, “intermediate rewards” can be given to agents for smaller things, such as staying on the right delivery path. 


Machine Learning Summary


Reinforcement learning is a bit more of a rough-and-tumble method of learning than supervised learning, but it is generally easier for developers to train ML agents through RL than other kinds of learning. RL works by letting the agent freely respond to inputs and stimuli and deal with the consequences, without providing it with a prewritten ruleset of what is preferred and what is not. Occasionally, the teachers (read: the developers) will give the agent a “reward” for making the right decision, which will help inform its future decisions and give it a picture of what behaviors are preferred and which are not.