In recent years, reinforcement learning, which is one of the methods of machine learning, has been attracting attention. For example, the AI ”Alpha Go” that defeated the Go world champion also uses reinforcement learning.
Reinforcement learning is another method along with supervised learning where the correct answer is given to the learning data and unsupervised learning where the correct answer is not given. can you explain the difference?
In this article, we will explain the mechanism with specific examples, using terms specific to reinforcement learning, such as the environment and agents. In addition, we will introduce case studies and future issues.
If you don’t know much about reinforcement learning, please read on.
What is reinforcement learning
Reinforcement learning is one method of machine learning.
Machine learning is a technology that allows computers to take in large amounts of data, discover patterns and rules, and use them to discriminate and predict unknown data.
So how do machines discover patterns and rules? There are three methods by which machine learning is classified.
Machine learning taxonomy
The first method is “ supervised learning ”.
The teacher referred to here is the “correct answer data”. Create an AI model by learning whether the given data is correct while referring to the correct data.
If you give the model unknown data, it will judge whether it is true or false.
A classic example of supervised learning is a classification program. After learning a large amount of dog and cat image data, if you give the model a new dog or cat image, it will automatically classify it.
In this way, it is common to divide into a learning step and a recognition/prediction step, and the topic of deep learning is one of them.
The second method is ” unsupervised learning “.
Unlike supervised learning, unsupervised learning refers to a learning method in which a machine learns without being given correct data.
Programs using unsupervised learning, unlike ” supervised learning “, do not learn a huge amount of supervised data. Instead, it analyzes the structure and characteristics of the data itself, groups them, and simplifies the data.
A representative method of this unsupervised learning is clustering. This technique, called “ clustering ,” works quite differently from supervised learning classification problems.Created by AINOW editorial department
The third method is ” reinforcement learning “.
Reinforcement learning is a method that improves the strategy that the machine takes as it learns. Instead of being fed the correct data, we go through trial and error, referring to some kind of behavioral desirability cues called “rewards.”
A good example of reinforcement learning is a cleaning robot. The cleaning robot learns through trial and error the “route that can collect more garbage” so that it can get as much “garbage” as a reward.
In addition, the process of humans becoming able to ride a bicycle can be likened to reinforcement learning. At first, it is wobbly and difficult to move forward.
However, by making a slight change in the method while referring to the evaluation, you will be able to move forward. By repeating this, you will come to know the best method, and you will gradually be able to handle longer distances.
Advantages and disadvantages of reinforcement learning
The advantage of reinforcement learning can be seen in the fact that AI beats humans in games such as Go and Shogi.
These games can sometimes lead to disagreements about what actions should be taken in individual phases. Therefore, it is difficult to evaluate individual actions.
However, reinforcement learning can evaluate and optimize such complex procedures involving multiple actions. Reinforcement learning is also good at walking control of robots.
In reinforcement learning, if walking for a long time is set as a reward, the joint movement angle, stride length, left and right timing, etc. can be learned automatically. But for supervised learning, we need to prepare those data with a set of inputs and outputs.
On the other hand, there are also disadvantages.
First, it takes a lot of time to learn. This is a drawback for machine learning in general, but it’s especially noticeable for reinforcement learning.
Also, the optimal behavior derived by machines may not be rational for humans. In general, it is important to choose the best learning method according to the purpose, rather than thinking that reinforcement learning is good or bad.
How Reinforcement Learning Works
Let’s take a closer look at how reinforcement learning works. Let’s use the bicycle example again.
First, let’s say you come to a park to ride a bicycle. Children want to be able to ride. At this time, we will call the park the “environment” and the child the “agent”.
Now, to ride a bike, you need to hold the handle, pedal, and keep your balance. These are called “actions”. An object that causes a change in action is called a “state”.
The result of the action is a change, such as staggering or leaning forward. This means that the state has changed. Depending on the state, you should be able to move forward or fall.
Set a “reward” for this result. For example, “to be able to run 50m on a bicycle” is given as 1 point.
The agent learns by calculating back what kind of action was good to take in order to get 1 point.
“ Supervised learning ” evaluates “each behavior” one by one, such as how to pedal, how to keep balance, and how to fall, while “ reinforcement learning ” evaluates “how to grip the steering wheel” and how to keep balance. Evaluate sequential actions.
Children maximize rewards through trial and error. In other words, it aims to go the long distance as a result, and learns the optimal rules of behavior for that purpose.
These rules of behavior are called policies.
What is Q-value and Q-learning method?
The Q-learning method, which is often heard in conjunction with reinforcement learning, is a typical algorithm for reinforcement learning.
In reinforcement learning, a reward is generated according to the action, and the action is corrected based on it. However, there is one problem.
Behaviors that maximize immediate rewards may not always be the most desirable behaviors in the long run.
For example, in shogi, the action of capturing the opponent’s important pieces looks desirable in the short term. However, if the action greatly undermines defense, it will rather be a negative action in the long run.
The long-term reward is called “value”. We need behaviors that maximize long-term value, not short-term rewards.
And the value obtained when taking a certain action in a certain state is called “Q value”.
The Q value contains the expected value. Calculating the expected value actually requires an accurate estimate of the following states:
But strictly speaking, it is not possible. This will be easier to understand with a concrete example.
For example, it is impossible to know what kind of action the opponent will return after taking the action of playing chess and how the state will change. Also, in Tetris, you don’t know “what’s coming in the next block.”
Therefore, if you actually take action and learn by updating the Q value one by one, you will eventually learn the problem without calculating the expected value specifically.
Q-learning is an algorithm for learning Q-values. Other algorithms include Sarsa and Monte Carlo methods.
What is deep reinforcement learning?
Reinforcement learning has become even more practical by combining it with deep learning to form deep reinforcement learning. Deep learning is one of “ supervised learning ”, also called deep learning .
The difference from conventional machine learning is that the machine learns the feature quantity.
For example, suppose you create an AI that can distinguish between dogs and cats. Traditionally, humans have specified where to focus. For example, the shape of the ears or the presence of a beard.
Deep learning, on the other hand, allows machines to automatically learn what is important. As a result, it may be more accurate than using human-specified features.
Since the results of reinforcement learning differ greatly depending on the reward given, it is difficult to decide what kind of reward should be given.
Deep reinforcement learning uses deep learning to express the value of actions. This will be the function for deriving the Q value mentioned above.
In other words, using deep learning for the Q function is deep reinforcement learning.
What is Q value : Value obtained when taking a certain action in a certain state
Deep reinforcement learning has produced successful examples even for tasks with high-dimensional states, such as game screens, and has made it possible to use reinforcement learning widely.
Reinforcement learning challenges
The appearance of learning the optimal behavior by itself is perfect for the image of a truly human-like artificial intelligence. On the other hand, in practice, reinforcement learning also faces various challenges.
First of all, reinforcement learning is just learning the optimal behavior in a certain environment.
For example, even if a car equipped with self-driving technology can successfully avoid obstacles, in order to actually run on public roads, it must follow rules such as traffic lights and signs.
Also, it is necessary to take special measures when people suddenly jump out. In this way, practical use is difficult unless the environment is simple.
For example, the fields of games and robots are good at reinforcement learning. Other areas are future challenges.
Reinforcement learning case
Let’s take a look at a practical example of using reinforcement learning.
game example
Gaming is an area where reinforcement learning excels. For example, in the video below, we are learning the optimal behavior in Mario Tennis.
robot example
Robots are also an area where reinforcement learning is good. For example, there is an application example for holding an object as shown in the video below.
car example
Reinforcement learning is also used in autonomous driving. It’s a game-like environment that’s far from public roads, but each car learns its route automatically to avoid collisions.
in conclusion
Reinforcement learning is a technique for learning optimal actions to maximize rewards.
This seems to be similar to “human learning,” in which means are tried by trial and error to achieve a goal. It is also a technique that has achieved remarkable results, as in the example of Go.
On the other hand, the scope of its utilization is still limited and there are still some issues. Depending on the purpose, it is important to use both supervised and unsupervised learning.