This is the twelfth article in a series dedicated to the various aspects of machine learning (ML). Today’s article will continue the discussion of reinforcement learning (RL) by covering some of the approaches that developers take to RL.
Our last article taught us that there is more than one way to grow a plant, but that some ways can end up getting you an F. When you are expected to grow a plant with no teacher guiding you (yet waiting to hand you anywhere from an A to an F depending on your performance), then you are doing what in the AI world is called reinforcement learning. Just as it is with growing a plant, there is more than one way to approach reinforcement learning, though the ones that work are expected to garner an A grade for the agent out in the real world.
We are going to cover four basic categories of RL: Model-based RL, model-free RL, passive RL, and active RL.
This is a type of reinforcement learning where agents discover which actions have rewards by using a transition model of the environment. To illustrate what a transition model is, let us give an example below.
Imagine for a second that arcades are still popular. It’s a Friday night, you’re all hopped up on sugar from a gas station slushie, and you roll in to Quartersnatchers, your local arcade, to play this new game called Barista!, where the objective of the game is to make and serve as many hot cups of Joe to customers as you can without messing up an order. When you get to the game, you see that there’s not one but two joysticks, and five unlabeled buttons, each of a different color. Perplexing, to say the least. Lucky for you, the workers at the arcade have taped a detailed list of instructions, and which actions are worth the most points (e.g., making a pumpkin spice latte nets you 500 points).
That list of instructions and point system is analogous to a transition model for an AI agent. Now, the transition model in a ML agent’s training environment may not be as freely given as Barista!’s instructions. Sometimes, the agent needs to learn the model through observation, like if the video game player knew which moves were worth the most points, but had to learn on their own how to operate the buttons and joysticks correctly.
Okay, now imagine that the employees of Quartersnatchers throw the coveted instructions to Barista! into the nearest slushie-filled garbage can, and leave the poor video game player to their own devices, specifically two joysticks and five buttons. This, basically, is what it is like for an AI agent to operate without a transition model in a training environment.
Of course, an AI agent is not left completely in the dark when the trainers don’t give it a transition model. One method used is Q-learning, where agents tell are told which actions can net them the most total rewards, leading them to privilege a certain action over another based on this hint.
So, in a model-free environment, an agent is expected to make decisions based on more limited information, as opposed to model-based, where an agent knows which outcomes are best, even if it doesn’t know what should be done to achieve the outcome.
Here, an agent is given a policy, which is a guide for which actions should be chosen based on the possible states it could be in. What the agent must learn for itself is which actions are worth the most points, enabling it to discover for itself how to maximize the possible rewards. So, it needs to familiarize itself with the many action paths that it can take in an environment.
With our Barista! player, this can correspond to knowing how to make the many available drinks, but not knowing which ones score the most points, or that adding extra sugar to a grande frappe is worth an additional 200 points.
The aim, then, is still to discover through trial-and-error how to best navigate the world the agent is placed in and complete its task.
An active learning agent does not have a policy telling which actions to take, but rather decides its actions for itself. The passive learning agent runs through its policy-directed actions to discover which action paths are best, but the active learning must truly experiment if it is to figure out how to accomplish its tasks. In essence, it must build its own policy through trial and error.
There can be issues here, such as if the agent, when building its own policy based off experimentation in the training environment, mistakenly believes that one action path is the optimal solution just because it has a decent reward. The downside here is that it won’t maximize utility in the long run because of its ignorance, like the Barista! player never learns about the extra sugar rule. That player may get a good score, but may not break the high score of the player who does know about the rule.
However, when it comes to real-world situations, developers tend to train agents with the knowledge of which actions can be truly dangerous in an irreversible way, like telling R2 the delivery robot to not run people over for the sake of making a quicker delivery.
Machine Learning Summary
Reinforcement learning can often resemble a game, and you can teach someone how to play a game in many ways. In model-based RL, it is like giving the player the goals of the game, though you may not tell them how to achieve those goals exactly. In model-free RL, the player doesn’t have instructions but can be given general hints about which moves are the best to make for the endgame. Passive RL is like telling the player what actions they should do in what situations without telling them which actions, and situations, are the best to choose/be in. Active RL is like throwing the player into the fire and letting them rely on intuition to figure out how to play the game, with some hints about which actions can be dangerous/not preferable.