Dear Fellow Scholars, this is Two Minute Papers with Károly Zsolnai-Fehér.
This is a collaboration between DeepMind and OpenAI on using human demonstrations to teach
an AI to play games really well.
The basis of this work is reinforcement learning, which is about choosing a set of actions in
an environment to maximize a score.
For some games, this score is typically provided by the game itself, but in more complex games,
for instance, ones that require exploration, this score is not too useful to train an AI.
In this project, the key idea is to use human demonstrations to teach an AI how to succeed.
This means that we can sit down, play the game, show the footage to the AI and hope
that it learns something useful from it.
Now the most trivial implementation of this would be to imitate the footage too closely,
or in other words, simply redo what the human has done.
That would be a trivial endeavor, and it is the most common way of misunderstanding what
is happening here, so I will emphasize that is not the case.
Just imitating what the human player does would not be very useful because one, it puts
too much burden on the humans, that's not what we want, and number two, the AI could
not be significantly better than the human demonstrator, that's also not what we want.
In fact, if we have a look at the paper, the first figure shows us right away how badly
a simpler imitation program performs.
That's not what this algorithm is doing.
What it does instead is that it looks at the footage as the human plays the game, and tries
to guess what they were trying to accomplish.
Then, we can tell a reinforcement learner that this is now our reward function and it
should train to become better at that.
As you see here, it can play an exploration-heavy game such as Atari "Hero", and in the
footage here above you see the rewards over time, the higher the better.
This AI performs really well in this game, and significantly outperforms reinforcement
learner agents trained from scratch on Montezuma's revenge as well, although it can still get
stuck on a ladder.
We discussed earlier a curious AI that was quickly getting bored by ladders and moved
on to more exciting endeavors in the game.
The performance of the new agent seems roughly equivalent to an agent trained from scratch
in the game Pong, presumably because of the lack of exploration and the fact that it is
very easy to understand how to score points in this game.
But wait, in the previous episode we just talked about an algorithm where we didn't
even need to play, we could just sit in our favorite armchair and direct the algorithm.
So why play?
Well, just providing feedback is clearly very convenient, but as we can only specify what
we liked and what we didn't like, it is not very efficient.
With the human demonstrations here, we can immediately show the AI what we are looking
for, and, as it is able to learn the principles and then improve further, and eventually become
better than the human demonstrator, this work provides a highly desirable alternative to
already existing techniques Loving it.
If you have a look at the paper, you will also see how the authors incorporated a cool
additional step to the pipeline where we can add annotations to the training footage, so
make sure to have a look!
Also, if you feel that a bunch of these AI videos a month are worth a dollar, please
consider supporting us at Patreon.com/twominutepapers.
You can also pick up cool perks like getting early access to all of these episodes, or
getting your name immortalized in the video description.
We also support cryptocurrencies and one-time payments, the links and additional information
to all of these are available in the video description.
With your support, we can make better videos for you.
Thanks for watching and for your generous support, and I'll see you next time!
Không có nhận xét nào:
Đăng nhận xét