Hi I'm Sally Goldman and I'm a research scientist at Google and one of the main things I work on is recommendation systems.
And one thing really fundamental to doing these recommendation systems is embeddings and I'm going to talk about those today.
As a motivating example I'm going to look at the problem of collaborative filtering.
So let's say I have a million movies and I have a half million users, and for each user I know which movies that user has watched.
The task is simple: I'd like to recommend movies to users.
To solve this problem I'm really going to have to learn some structure, something that let's me say these movies are similar to each other, so if you've watched these 3 movies then this is a good movie to recommend.
So as a simple starting point, let's try to take these movies and just put them along a line of one dimensional embedding.
So I will say I have maybe to the left I'll put animated movies and as I move to the right, I'll have more adult-like movies.
This starts to do nice things.
I have Shrek and The Incredibles, those are both animated movies for kids and if you watch one the other one is a good recommendation.
But then I have the The Triplets of Belleville which is an animated movie but really Harry Potter, though not an animated movie, I think is really a much closer movie to The Incredibles.
The Triplets of Belleville is not really oriented for kids as much, it's not sort of a blockbuster movie that a lot of people go to see.
And on the other side for example I'd say Blue and Memento are probably better recommendations for each other than The Dark Knight Rises.
So just having a single line, as much as I try, it's going to be really hard to capture all the intricacies in movies that make people like one versus another.
So what if we add another dimension and now I have 2 dimensions?
So what if I bring the blockbuster movies up towards the top and the more art house movies down?
Now I've achieved some of the things I've wanted.
I've got Shrek and The Incredibles and Harry Potter kinda nearby and they're all pretty similar movies and in the bottom right I have Blue and Memento.
And you can imagine that there's a lot of other aspects you'd want to capture and you'd want more than 2 dimensions, and we would.
In reality we could imagine 20, 50, even 100 dimensions to sort of do these embeddings.
But let's stick with 2 dimensions because I can draw it.
So let's add a few more movies to this and I went ahead and added some axis.
I have the X axis which is sort of more children oriented movies to the left and more adult movies to the right.
And the Y axis, more blockbuster movies to the top and more art house films on the bottom.
And you can see a lot of nice structure here and you can see that movies nearby each other are kind of similar and that's really the goal of what we want.
Now I'm drawing this geometrically but I do want to make sure everyone understands that there's a very simple way to represent these embeddings and that's what's going to happen when I learn them in a deep neural network.
So just using Shrek and Blue as an example, each of these is just a single point in this two dimensional space and the way we write down a point is just a value on the X axis and a value on the Y axis.
So for example Shrek is just the point minus 10.95 or Blue is 0.65 minus 0.2.
So each movie here can just be represented as two reels and the similarity between movies is now captured by how close these points are.
And although I'm only going to draw 2 dimensions, in reality you do want to do this in D dimensions, 2 isn't going to be enough to capture everything.
Implicitly as you think about what you're doing, this is really assuming that interest in movies can be captured by D dimensions.
I'm allowing D different aspects to be selected and then I can move the movies independently among these D aspects and use that to now bring similar movies nearby to each other.
Each movie now is just a D dimensional point, I can write it down as D real values and the cool thing is we can actually learn these embeddings from data and we can do this with a deep neural network without adding a lot of new things to what you've already seen.
There's no separate training process needed, we're just going to use back propagation exactly as before and the embedding layer is just a hidden layer and we'll have one unit for every dimension you want in your embedding.
Supervised information is going to allow us to tailor these embeddings for whatever task you're after.
If you want to do movie recommendation, then we want these embeddings to be geared towards recommending movies.
We will need some sort of training signal, we'll look at some concrete examples but in this example if a user has watched a set of movies then to some extent those movies are similar to each other and should be nearby and we'll aggregate this of course over lots of data.
Intuitively these hidden units are learning how to organize the data in a way to optimize whatever metric we've decided to put as the final objective of the network.
So now let's go back and look at how would this actually be input to the neural network.
The matrix I show on the right is sort of the classic way we think of collaborative filtering input.
I have one row for every user and one column for every movie and a check in this simple case indicates the user has watched the movie.
So now let's think about how we do this within TenserFlow.
Each example is really just going to be one row of this matrix, so let's focus on the bottom row that I've highlighted in yellow.
If there's a half million movies I don't really want to list all the movies you haven't watched, it's so much more efficient to just write down the movies you have watched.
And when I do back propagation I'll be computing dot products and I'd like that also, the time, just to depend on the movies you have watched.
So to achieve this we're going to use the following input representation and to do this we're going to have 2 phases.
The first pre-processing phase we're going to build what we call a dictionary.
A dictionary is just a mapping from each feature, in this case each movie, to an integer from 0 to the number of movies -1.
So I'll just do this in the order I've shown them in the columns.
So column 0 I'll call movie 0, column 1 movie 1 and so on, and this is a one time thing we do as pre-processing.
Now I can efficiently represent that bottom example as just the 3 movies that user did watch, I don't need to worry about all the other ones.
I do it kind of as a pictorial view but in reality it's just 3 integers - 1, 3, 999,999 - because those are the indices for the 3 movies that user has watched.
Okay so now that we have the input representation we can now look at how this fits into the full network and I'm going to use 3 different examples to help illustrate it.
The first example I want to look at is the problem of predicting a home sales price.
So this would traditionally be done as a regression problem.
I'd like to optimize the square loss between the predicted price and the true sale price.
So the thing that I really would like to create an embedding layer for here are the words in the sale, the house description ad.
Because although there are a set of words, I really need to understand what words are similar in terms of figuring out the size of the house so I may say this is a spacious house or I may say it's roomy.
Those are words that are used that kind of capture the same thing and so I want to begin understanding how these words that real estate agents put in ads helps us understand something about the home.
So we have lots and lots of words that might be in an ad, and any given ad has 100 words or so, and so again we really do want the sparse embedding just like we talked about but my vocabulary is over words versus movies.
I'm going to learn a 3 dimensional embedding in this little toy example just so I can draw it, again in reality you'd probably want a lot more than 3 dimensions.
And I'm always in these examples going to draw my embedding layer as green, it's really a hidden layer, in this case 3 units because I want a 3 dimensional embedding.
I also may have other input data like the latitude, longitude, number of rooms and you can add all that, I just used latitude and longitude as an example.
And then in pink I'm showing the fact that we can have whatever other hidden layers we want, these are just your standard hidden layers, you can have as many as you want.
You can decide how many units and then at the end they'll go into a single unit that [unintelligible] the regression problem will give us a real value and will optimize the L2 loss with the sale price.
In the process of doing back propagation just like you've seen, the embedding layer will be learned.
As another example, suppose I want to learn to classify handwritten digits.
So I have the digits 0 to 9 and I have some training data where there's actually a label of the correct digit.
So here this sparse thing I want to create an embedding of is just the raw bitmap of the drawing, whether there's a white or black, so 0 or 1.
I can introduce whatever other features I'd like and again I have an embedding layer which I'll stick with keeping them 3 dimensions, so the representation of the digital will go into that.
In pink I show we can have whatever additional hidden layers and in this case we'll have a [unintelligible] layer.
We're gonna have the 10 digits and basically learn a probability distribution over the digits of how probable we think it is that this is each of the digits.
I can take the one hot target probability distribution from what I know the right answer is and optimize a soft max loss.
In the process of doing this, in training with back propagation, I will learn to embed the images.
And now let's look at the example we've been studying of collaborative filtering, the movie recommendation problem.
This is actually interesting, it brings up an aspect we haven't seen yet which is where is my training data here, right?
I just know for each user there is a set of movies, so how do I know what the right movie to recommend is? What am I going to use as the label?
What we do is, suppose the users watch 10 movies, we use a simple trick.
We'll randomly pick 3 movies and hold those out, take them away and those are the labels, so those are the movies I'd like to recommend, they're good recommendations because you watched them, and I'll take the other 7 movies and use them as my training data.
Once I've done that, this is very similar to what we just talked about with the character recognition.
I'll take the 7 movies that are my training data and we know how we can get the sparse representation, we'll bring them into the embedding layer.
We can take whatever other features we want, maybe the genre, maybe the director, whatever else we want to take about the movie or the user and then we can bring those into additional hidden layers and we'll have a logit layer.
And note this logit layer is big, instead of 10 different nodes like in the digit prediction, if I had a half million movies there's gonna be a half million of these.
There's issues with that, it's out of the scope of this discussion.
But we will get a distribution over those half million movies of what movies we think you'd like, we will then optimize the soft and max loss with the held out movies that we know you do like.
And in doing this in the back propagation and just the standard training, we will learn the embeddings of the movies like we talked about.
So I do want to come back now and just make sure it's clear how what we learned in the deep neural network ties to the geometric view I gave at the beginning.
Let's look at the deep network on the left and let's take a single movie.
Right if you think of the input layer, each of those nodes at the bottom represent as one of these half million movies, I've picked one movie and just made it black.
In this example I said I had 3 hidden units so I was going with 3 dimensional embedding.
So that black node will have an edge connecting it to each of those units; I used red for the first one, magenta for the second and brown for the third one.
When you're done training your neural network, those edges are weights, each edge has a real value associated with it, that's my embedding.
The red is my X value, the magenta is my Y value and the brown is the Z.
So this particular movie would be embedded in a 3 dimensional space as 0.9, 0.2 and 0.4.
As with any deep neural network there are hyperparamaters and one of the hyperparameters we have in the embedding layer is how many embedding dimensions, how many hidden units do you want in that layer?
Higher dimensions are good because it allows us to tease apart more distinctions and therefore we can learn better relationships.
On the downside, as I increase the number of dimensions there is also a chance of overfitting and it's going to lead to slower training and the need for more data.
So a good empirical rule of thumb is the number of dimensions to be roughly the fourth root of the size of my vocabulary, the number of possible values.
But this is just a rule of thumb and with all hyperparameters you really need to go use validation data and try it out for your problem and see what gives the best results.
An embedding can also just be thought of as a tool.
One of the things we get from these embeddings is we map items - movies, texts for example the words in the housing description - to these low dimensional real vectors in a way that similar items are nearby.
It creates structure into these items that really we didn't have any structure and the structure is in fact geared towards what you're trying to do with it.
We can also apply embeddings to dense data, for example if I look at the way audio or soundtracks are represnted, it's already dense.
But we don't have any meaningful metric, I don't know how to say this audio is similar to that.
And so we can use embeddings just to learn a similarity metric among already dense data, and even further we can embed diverse types of data - texts, images, audio - jointly and learn a similarity metric across them.
Không có nhận xét nào:
Đăng nhận xét