Thứ Sáu, 12 tháng 10, 2018

Youtube daily Oct 12 2018

Hi, my name is D. Sculley.

I'm one of the people who is coming to you from Google in order to present this

Machine Learning Crash Course with TensorFlow APIs.

Now before we dive in, let's take a second to remind ourselves

of the basic framework that we are talking about in this class.

And that basic framework is supervised machine learning.

In supervised machine learning, we are learning to create models that combine

inputs, to produce useful predictions even on previously unseen data.

Now, when we're training that model, we're providing it with labels.

And in the case of, say, email spam filtering,

that label might be something like 'spam or not spam'.

It's the target that we're trying to predict.

The features are the way that we represent our data.

So features might be drawn from an email as, say, words in the email

or "to and from addresses", various pieces of routing or header information,

any piece of information that we might extract from that email to represent it

for our machine learning system.

An example, is one piece of data.

For example, one email.

Now that could be a labeled example, in which we have both feature information,

represented in that email, and the label value, of 'spam or not spam'.

Maybe that's come from a user who has provided that to us.

Or we could have an unlabeled example, such as a piece of email

for which we have feature information, but we don't yet know

whether it is spam or not spam.

And likely what we are going to do is classify that

to put it in the user's inbox or spam folder.

Finally, we have a model and that model is the thing that is doing the predicting.

It's something that we're going to try and create

through a process of learning from data.

For more infomation >> Framing - Duration: 1:44.

-------------------------------------------

Course Overview - Duration: 0:56.

>> CHRISTINE: Welcome to Google's Machine Learning Crash Course with TensorFlow APIs.

My name's Christine, and I'm one of the many Googlers working on machine learning.

This course will introduce you to key machine learning concepts.

We've aimed the course at programmers with little to no background in machine learning.

Non-programmers who are comfortable with math

will also find plenty of opportunities in this course to learn about machine learning.

This course concentrates on practical machine learning.

We introduce machine learning concepts and show you how to solve real-world problems.

The course has a relatively narrow focus -

aiming to cover the key algorithms in supervised learning.

As you'll discover, machine learning requires a different mindset than other programming problems.

For example, real-world machine learning focuses far more on data analysis than on coding.

We hope you enjoy this course, learn a lot, and have fun doing machine learning.

For more infomation >> Course Overview - Duration: 0:56.

-------------------------------------------

Descending into ML - Duration: 2:54.

So as we said before, our model is something that we learned from data.

And there are lots of complicated model types

and lots of interesting ways we can learn from data.

But we're gonna start with something very simple and familiar.

This will open the gateway to more sophisticated methods.

Let's train a first little model from data.

So here we've got a small data set.

On the X axis, we've got our input feature,

which is showing housing square footage.

On our Y axis, we've got the target value

that we're trying to predict of housing price.

So we're gonna try and create a model that takes in

housing square footage as an input feature

and predicts housing price as an output feature.

Here we've got lots of little labeled examples in our data set.

And I'm go ahead and channel our inner ninth grader to fit a line.

It can maybe take a look at our data set and

fit a line that looks about right here. Maybe something like this.

And this line is now a model that predicts housing price given an input.

We can recall from algebra one that we can define this thing

as Y = WX + B.

Now in high school algebra we would have said MX,

here we say W because it's machine learning.

And this is referring to our weight vectors.

Now you'll notice that we've got a little subscript here

because we might be in more than one dimension.

This B is a bias.

and the W gives us our slope.

How do we know if we have a good line?

Well, we might wanna think of some notion of loss here.

Loss is showing basically how well our line

is doing at predicting any given example.

So we can define this loss

by looking at the difference between the prediction for a given X value

and the true value for that example.

So this guy has some moderate size loss.

This guy has near-zero loss.

Here we've got exactly zero loss.

Here we probably have some positive loss.

Loss is always on a zero through positive scale.

How might we define loss?

Well, that's something that we'll need to think about in a slightly more formal way.

So let's think about one convenient way to define loss for regression problems.

Not the only loss function, but one useful one to start out with.

We call this L2 loss, which is also known as squared error.

And it's a loss that's defined for an individual example

by taking the square of the difference between our model's prediction and the true value.

Now obviously as we get further and further away from the true value,

the loss that we suffer increases with a square.

Now, when we're training a model we don't care about minimizing loss on just one example,

we care about minimizing loss across our entire data set.

For more infomation >> Descending into ML - Duration: 2:54.

-------------------------------------------

Embeddings - Duration: 14:44.

Hi I'm Sally Goldman and I'm a research scientist at Google and one of the main things I work on is recommendation systems.

And one thing really fundamental to doing these recommendation systems is embeddings and I'm going to talk about those today.

As a motivating example I'm going to look at the problem of collaborative filtering.

So let's say I have a million movies and I have a half million users, and for each user I know which movies that user has watched.

The task is simple: I'd like to recommend movies to users.

To solve this problem I'm really going to have to learn some structure, something that let's me say these movies are similar to each other, so if you've watched these 3 movies then this is a good movie to recommend.

So as a simple starting point, let's try to take these movies and just put them along a line of one dimensional embedding.

So I will say I have maybe to the left I'll put animated movies and as I move to the right, I'll have more adult-like movies.

This starts to do nice things.

I have Shrek and The Incredibles, those are both animated movies for kids and if you watch one the other one is a good recommendation.

But then I have the The Triplets of Belleville which is an animated movie but really Harry Potter, though not an animated movie, I think is really a much closer movie to The Incredibles.

The Triplets of Belleville is not really oriented for kids as much, it's not sort of a blockbuster movie that a lot of people go to see.

And on the other side for example I'd say Blue and Memento are probably better recommendations for each other than The Dark Knight Rises.

So just having a single line, as much as I try, it's going to be really hard to capture all the intricacies in movies that make people like one versus another.

So what if we add another dimension and now I have 2 dimensions?

So what if I bring the blockbuster movies up towards the top and the more art house movies down?

Now I've achieved some of the things I've wanted.

I've got Shrek and The Incredibles and Harry Potter kinda nearby and they're all pretty similar movies and in the bottom right I have Blue and Memento.

And you can imagine that there's a lot of other aspects you'd want to capture and you'd want more than 2 dimensions, and we would.

In reality we could imagine 20, 50, even 100 dimensions to sort of do these embeddings.

But let's stick with 2 dimensions because I can draw it.

So let's add a few more movies to this and I went ahead and added some axis.

I have the X axis which is sort of more children oriented movies to the left and more adult movies to the right.

And the Y axis, more blockbuster movies to the top and more art house films on the bottom.

And you can see a lot of nice structure here and you can see that movies nearby each other are kind of similar and that's really the goal of what we want.

Now I'm drawing this geometrically but I do want to make sure everyone understands that there's a very simple way to represent these embeddings and that's what's going to happen when I learn them in a deep neural network.

So just using Shrek and Blue as an example, each of these is just a single point in this two dimensional space and the way we write down a point is just a value on the X axis and a value on the Y axis.

So for example Shrek is just the point minus 10.95 or Blue is 0.65 minus 0.2.

So each movie here can just be represented as two reels and the similarity between movies is now captured by how close these points are.

And although I'm only going to draw 2 dimensions, in reality you do want to do this in D dimensions, 2 isn't going to be enough to capture everything.

Implicitly as you think about what you're doing, this is really assuming that interest in movies can be captured by D dimensions.

I'm allowing D different aspects to be selected and then I can move the movies independently among these D aspects and use that to now bring similar movies nearby to each other.

Each movie now is just a D dimensional point, I can write it down as D real values and the cool thing is we can actually learn these embeddings from data and we can do this with a deep neural network without adding a lot of new things to what you've already seen.

There's no separate training process needed, we're just going to use back propagation exactly as before and the embedding layer is just a hidden layer and we'll have one unit for every dimension you want in your embedding.

Supervised information is going to allow us to tailor these embeddings for whatever task you're after.

If you want to do movie recommendation, then we want these embeddings to be geared towards recommending movies.

We will need some sort of training signal, we'll look at some concrete examples but in this example if a user has watched a set of movies then to some extent those movies are similar to each other and should be nearby and we'll aggregate this of course over lots of data.

Intuitively these hidden units are learning how to organize the data in a way to optimize whatever metric we've decided to put as the final objective of the network.

So now let's go back and look at how would this actually be input to the neural network.

The matrix I show on the right is sort of the classic way we think of collaborative filtering input.

I have one row for every user and one column for every movie and a check in this simple case indicates the user has watched the movie.

So now let's think about how we do this within TenserFlow.

Each example is really just going to be one row of this matrix, so let's focus on the bottom row that I've highlighted in yellow.

If there's a half million movies I don't really want to list all the movies you haven't watched, it's so much more efficient to just write down the movies you have watched.

And when I do back propagation I'll be computing dot products and I'd like that also, the time, just to depend on the movies you have watched.

So to achieve this we're going to use the following input representation and to do this we're going to have 2 phases.

The first pre-processing phase we're going to build what we call a dictionary.

A dictionary is just a mapping from each feature, in this case each movie, to an integer from 0 to the number of movies -1.

So I'll just do this in the order I've shown them in the columns.

So column 0 I'll call movie 0, column 1 movie 1 and so on, and this is a one time thing we do as pre-processing.

Now I can efficiently represent that bottom example as just the 3 movies that user did watch, I don't need to worry about all the other ones.

I do it kind of as a pictorial view but in reality it's just 3 integers - 1, 3, 999,999 - because those are the indices for the 3 movies that user has watched.

Okay so now that we have the input representation we can now look at how this fits into the full network and I'm going to use 3 different examples to help illustrate it.

The first example I want to look at is the problem of predicting a home sales price.

So this would traditionally be done as a regression problem.

I'd like to optimize the square loss between the predicted price and the true sale price.

So the thing that I really would like to create an embedding layer for here are the words in the sale, the house description ad.

Because although there are a set of words, I really need to understand what words are similar in terms of figuring out the size of the house so I may say this is a spacious house or I may say it's roomy.

Those are words that are used that kind of capture the same thing and so I want to begin understanding how these words that real estate agents put in ads helps us understand something about the home.

So we have lots and lots of words that might be in an ad, and any given ad has 100 words or so, and so again we really do want the sparse embedding just like we talked about but my vocabulary is over words versus movies.

I'm going to learn a 3 dimensional embedding in this little toy example just so I can draw it, again in reality you'd probably want a lot more than 3 dimensions.

And I'm always in these examples going to draw my embedding layer as green, it's really a hidden layer, in this case 3 units because I want a 3 dimensional embedding.

I also may have other input data like the latitude, longitude, number of rooms and you can add all that, I just used latitude and longitude as an example.

And then in pink I'm showing the fact that we can have whatever other hidden layers we want, these are just your standard hidden layers, you can have as many as you want.

You can decide how many units and then at the end they'll go into a single unit that [unintelligible] the regression problem will give us a real value and will optimize the L2 loss with the sale price.

In the process of doing back propagation just like you've seen, the embedding layer will be learned.

As another example, suppose I want to learn to classify handwritten digits.

So I have the digits 0 to 9 and I have some training data where there's actually a label of the correct digit.

So here this sparse thing I want to create an embedding of is just the raw bitmap of the drawing, whether there's a white or black, so 0 or 1.

I can introduce whatever other features I'd like and again I have an embedding layer which I'll stick with keeping them 3 dimensions, so the representation of the digital will go into that.

In pink I show we can have whatever additional hidden layers and in this case we'll have a [unintelligible] layer.

We're gonna have the 10 digits and basically learn a probability distribution over the digits of how probable we think it is that this is each of the digits.

I can take the one hot target probability distribution from what I know the right answer is and optimize a soft max loss.

In the process of doing this, in training with back propagation, I will learn to embed the images.

And now let's look at the example we've been studying of collaborative filtering, the movie recommendation problem.

This is actually interesting, it brings up an aspect we haven't seen yet which is where is my training data here, right?

I just know for each user there is a set of movies, so how do I know what the right movie to recommend is? What am I going to use as the label?

What we do is, suppose the users watch 10 movies, we use a simple trick.

We'll randomly pick 3 movies and hold those out, take them away and those are the labels, so those are the movies I'd like to recommend, they're good recommendations because you watched them, and I'll take the other 7 movies and use them as my training data.

Once I've done that, this is very similar to what we just talked about with the character recognition.

I'll take the 7 movies that are my training data and we know how we can get the sparse representation, we'll bring them into the embedding layer.

We can take whatever other features we want, maybe the genre, maybe the director, whatever else we want to take about the movie or the user and then we can bring those into additional hidden layers and we'll have a logit layer.

And note this logit layer is big, instead of 10 different nodes like in the digit prediction, if I had a half million movies there's gonna be a half million of these.

There's issues with that, it's out of the scope of this discussion.

But we will get a distribution over those half million movies of what movies we think you'd like, we will then optimize the soft and max loss with the held out movies that we know you do like.

And in doing this in the back propagation and just the standard training, we will learn the embeddings of the movies like we talked about.

So I do want to come back now and just make sure it's clear how what we learned in the deep neural network ties to the geometric view I gave at the beginning.

Let's look at the deep network on the left and let's take a single movie.

Right if you think of the input layer, each of those nodes at the bottom represent as one of these half million movies, I've picked one movie and just made it black.

In this example I said I had 3 hidden units so I was going with 3 dimensional embedding.

So that black node will have an edge connecting it to each of those units; I used red for the first one, magenta for the second and brown for the third one.

When you're done training your neural network, those edges are weights, each edge has a real value associated with it, that's my embedding.

The red is my X value, the magenta is my Y value and the brown is the Z.

So this particular movie would be embedded in a 3 dimensional space as 0.9, 0.2 and 0.4.

As with any deep neural network there are hyperparamaters and one of the hyperparameters we have in the embedding layer is how many embedding dimensions, how many hidden units do you want in that layer?

Higher dimensions are good because it allows us to tease apart more distinctions and therefore we can learn better relationships.

On the downside, as I increase the number of dimensions there is also a chance of overfitting and it's going to lead to slower training and the need for more data.

So a good empirical rule of thumb is the number of dimensions to be roughly the fourth root of the size of my vocabulary, the number of possible values.

But this is just a rule of thumb and with all hyperparameters you really need to go use validation data and try it out for your problem and see what gives the best results.

An embedding can also just be thought of as a tool.

One of the things we get from these embeddings is we map items - movies, texts for example the words in the housing description - to these low dimensional real vectors in a way that similar items are nearby.

It creates structure into these items that really we didn't have any structure and the structure is in fact geared towards what you're trying to do with it.

We can also apply embeddings to dense data, for example if I look at the way audio or soundtracks are represnted, it's already dense.

But we don't have any meaningful metric, I don't know how to say this audio is similar to that.

And so we can use embeddings just to learn a similarity metric among already dense data, and even further we can embed diverse types of data - texts, images, audio - jointly and learn a similarity metric across them.

For more infomation >> Embeddings - Duration: 14:44.

-------------------------------------------

Annoying Orange - Storytime: The Legend of Sleepy Hollow! #Shocktober - Duration: 3:37.

For more infomation >> Annoying Orange - Storytime: The Legend of Sleepy Hollow! #Shocktober - Duration: 3:37.

-------------------------------------------

Jersey Shore: Family Vacation (Season 2) | Official Midseason Supertease | MTV - Duration: 1:40.

♪ Let me take a selfie ♪

- [Group] Ohhhh

(laughing)

- [Snooki] Fight me, bitch!

- What is happening right now?

- [Male's Voice] I don't know

- [Woman's Voice] Let's go take some shots.

- [Computer Voice] Get ready

to Party.

(crowd cheering)

- This is going to be out of control.

It's going to be a (bleep) show.

(playful shouting)

(all exclaiming)

- [Male Voice] You're kidding.

(police sirens)

- [Police Officer] Mike, we have you surrounded.

- Can I get everyone's attention for a second?

- Jen is in town, and I would love

for her to be with us this weekend.

- [Pauly D] Run, mother (bleep), run.

- Your mugshot, you look hot.

- [Pauly D in High Pitched Voice through megaphone ] Awkward

(police sirens)

- Ron and Jen, they're not the only ones

that love a toxic relationship.

- [Male Voice] He just put his feet in her face.

- [Girl's Voice] You should bang him.

Again.

(noisemaker) - Huh?

- [Vinny] Stop talking to me. - Bad, bad, bad.

♪ All my friends are heathens, take it slow ♪

- (multiple people shouting) - Stop talking to me.

- (beep) you.

- We're allowed to (beep) with each other

all day every day, but don't ever (beep) with one of us.

(unauditable yelling)

- Really? - You're a bitch.

- Do you wanna get whopped by a (beep) bitch?

(group yells whoa)

(screaming)

- [male yelling] Oh she punched him.

She punched him.

- Between all the fights,

(yelling)

and the drama.

- [male Voice] - Yo, they're going into the jacuzzi.

- [second male Voice] Oh my God, she's naked.

Oh my God, he's naked.

- I'm living with a bunch

(screaming and laughing)

(cheering)

- of savages.

- (all three men speak at once)

We're coming for you.

(menacing laughter)

(yelling)

(laughter)

For more infomation >> Jersey Shore: Family Vacation (Season 2) | Official Midseason Supertease | MTV - Duration: 1:40.

-------------------------------------------

Real-World Guidelines - Duration: 0:59.

So to finish up this session, we want to think about some effective machine learning guidelines.

When you're going off to create a machine learning system, please, keep the very first model extremely simple.

A simple linear model is absolutely the place to start, so you can verify the pipeline correctness.

You want to make sure that the data pipeline is in fact fully correct, end to end, before doing any iteration on model quality, because bugs in the data pipeline are extremely hard to track down later on.

You wanna use a simple, observable metric as your first thing to use for training and evaluation, so that you can verify that the model is behaving as you expect.

You want to own and monitor your input features as much as possible.

You want to treat your model configuration as code and make sure that any time you configure a model, that that is reviewed by a teammate and checked in.

Make sure to write down the results of all your experiments, even failures: documentation is incredibly important for later debugging.

For more infomation >> Real-World Guidelines - Duration: 0:59.

-------------------------------------------

Static vs. Dynamic Training - Duration: 2:18.

One key consideration when designing a machine learning system is whether we do training in a static way, offline, or in a dynamic way in a continuously updating online fashion.

Now when we say static offline training, what we mean is that we essentially have a big store of data and we train our model exactly once, before it's used for a long period of time.

When we talk about dynamic online training, we mean that data is continually coming into our system and we're incorporating that data into the model through small updates.

These both have different strengths and weaknesses.

Now for a static model that's trained offline, there are some definite pluses.

It's easy to build and test, we can just use batch quota, it's nice and cheap, and we can iterate on that model until we know it's good, it's nicely verified, and we can make sure that everything is working well before we ship it off to serving.

The downside here is that it still requires monitoring at the input level at inference time.

If our distribution of inputs changes and our model hasn't adapted, well, we may end up with screwy predictions.

In a similar vein, it's easier for this model to grow stale.

You might imagine if a model is trained on pre-iPhone data and suddenly the iPhone comes out, there's no good way for us to get that information into the model without accounting for some kind of retraining.

A dynamic model that's trained into a continuously updating fashion, of course has much more heavy weight in terms of monitoring.

We need to have more system complexity and larger monitoring capabilities in order to make sure that this thing isn't going off the rails.

The plus side for all of this extra work is that we have a model that's able to adapt over time as new data comes in, and so we don't have this staleness issue.

So a good place to use offline training is when we don't think that our data's going to change very much over time.

An example for this might be a large image recognition model, where there just aren't that many new kinds of objects that come into the world.

A model that's trained online might be more appropriate for situations where in fact there are trends and seasonalities that change quite often over time and we want to make sure that we're as up to date as possible.

For more infomation >> Static vs. Dynamic Training - Duration: 2:18.

-------------------------------------------

Logistic Regression - Duration: 3:41.

Let's imagine for a second that we've got the problem of predicting the probability of heads for some coin flips where maybe the coin is slightly bent.

You might use features like angle of bend, coin mass, all kinds of stuff in there.

What's the simplest model you could think of using?

Well we could maybe use linear regression like we've used before.

But there might be some weird things there.

For example, what if we have a new coin that we're predicting on that has a very heavy mass, we've never seen before?

Or what if we have an extremely large bend of the angle?

We might end up with predictions that are outside the range 0 to 1.

That will be really weird for probabilities because probabilities are a special thing.

Probabilities are bounded between 0 and 1 and if we have a prediction model that gives us a value outside that range we're gonna be in trouble.

Especially if we say try and multiply predicted probabilities or use them to create expected values, stuff like that.

Well as a first hack, you could try and cap that prediction to ignore any outlier values.

But now we've introduced bias into our model and we're not going to be very good there.

So really the right thing to do is to come up with a slightly different loss function and prediction method.

That allows our values to be interpreted naturally as 0 to 1 probabilities and never exceeds that range 0 to 1.

So we call this idea logistic regression.

It's a prediction method that gives us well calibrated probabilities and these are fantastically useful.

We can use these as real probabilities and multiply them with things to get expected values and use them also for things like classification tasks where we really want to know the real probability that an email is spam or not spam.

So let's take a look at how this might work.

We take our familiar linear model and we stick it into a sigmoid.

And a sigmoid is basically something that gives us a bounded value between 0 and 1.

There's asymptotes so it never quite hits 0 and it never quite hits 1.

At training time we're training using a different loss function, as we've said square loss is not going to cut it here.

We use something called log loss, that if you squint closely at the formula there, it looks very similar to Shannon's entropy measure from information theory.

Now you don't need to understand the math to be able to take a look at the graphical interpretation.

Where you'll notice that as you get closer and closer to one of the bars, the loss gets very very high quite quickly.

Again we see those asymptotes coming into play.

So those asymptotes are actually quite important to think about in terms of learning.

Because of those asymptotes we will need to really incorporate regularization quite explicitly into our learning.

If we don't, then on a given dataset, the model may try and fit our data ever more closely trying to drive those losses near 0.

So L2 regularization can be extremely helpful here to make sure that our weights don't go crazy out of balance.

Why do we like linear logistic regression?

Well one thing is that it's really fast, it's extremely efficient to train and it's efficient to make predictions.

So when we need a method that scales well to massive data or that we need to use for extremely low latency predictions, linear logistic regression can be a great choice.

If we need non-linearities we can get them by adding in feature cross products.

For more infomation >> Logistic Regression - Duration: 3:41.

-------------------------------------------

Validation - Duration: 1:46.

So now we have this powerful test and training set methodology.

So let's imagine we're using it in practice.

We've got our test set, we've got our training set, we did a good job of separating them out.

And we're gonna now do some iterations.

I'm going to train a model on my training data, I'm gonna test it on my test data and I'm gonna observe its' metrics

And I'm gonna tweak some setting, maybe I'm gonna tweak the learning rate or something like that.

I'm gonna try again and see if I can improve my test set accuracy.

Maybe I'm gonna add some features in, maybe I'm gonna take those features out, and keep iterating and iterating until I find the best possible model that I can, based on my test set metrics.

Are there any problems here?

Well, one thing I could imagine is that maybe I'm starting now to over-fit to the peculiarities of my test data.

That's too bad.

So, here's another way to handle that.

I can create a third data set out of my partitions, I'm gonna call this my "validation data".

And I'm gonna use a new, slightly augmented, iterative approach, where I'm going to do my iterations by training on my training data, and then evaluating only on my validation data.

Keeping my test data way off to the side and completely unused.

I'm gonna iterate and iterate, tweaking whatever parameters or making whatever changes I want to my model until I get very good results on my validation data.

I'm then, and only then, going to test my model on the final test data.

And I'm gonna make sure that the results that I'm getting on the test data basically match what I'm getting on my validation data.

If they don't, that's a pretty good signal that maybe I was over-fitting to the validation set.

For more infomation >> Validation - Duration: 1:46.

-------------------------------------------

Ryan Gosling And Claire Foy Open Up About Filming 'First Man' | TODAY - Duration: 4:15.

For more infomation >> Ryan Gosling And Claire Foy Open Up About Filming 'First Man' | TODAY - Duration: 4:15.

-------------------------------------------

Intro to Neural Nets - Duration: 2:50.

At this point, we should recognize this problem as a simple, non-linear problem.

Something that we can solve easily with feature cross products.

But what happens if we get a slightly more complicated problem?

Maybe something that looks like this.

At some level we've got maybe a set of spirals interacting with each other.

Now we can probably sit around and do some math and think of the right feature cross products to add.

But it's easy to think that our data sets might be more and more complicated.

And eventually we would like some way for our models to learn the non-linearities themselves without us having to specify them manually.

This is the promise of deep neural nets, that do an especially good job at complex data, including image data, audio data, and video data.

We'll learn more about neural nets in this section.

So we'd like to have models that learn the non-linearities themselves, without us having to specify them manually.

How are we gonna to do that?

Well we probably need a model with some additional structure to it.

Let's take a look at our linear model.

Where we have a number of inputs, each with a weight that's combined linearly, to produce an output.

Well, if we wanna get a non-linearity in there, maybe we need to have an additional layer in there.

So now we can add those guys up in a nice linear combination, into a second layer.

That second layer gets linear combined and we haven't yet achieved any non-linearity.

Because a linear combination of linear functions is still linear.

Well, that's not good enough, so clearly what we need is a second layer right?

So we put a second layer in there and.. we're still linear.

Because even if we add as many layers as we want, any linear combinations of linear functions is still gonna be linear.

Okay, we need to do something else.

And that something else is we need to stick in a non-linearity.

That non-linearity can go at the output of any of our little hidden notes in there.

One common non-linearity that we use is called ray lou.

And this takes a linear function, and chops it off at zero.

So if you're above zero, you're a linear function; if your function returns a value below zero, we cap that at zero.

Simplest possible non-linear function, and this allows us to create non-linear models.

Now we could use any non-linearity in here, a lot of folks also use [unintelligible], but it turns out that ray lou gives state of the art results for a wide number of problems, and it's very simple.

Once we had this, we can stack these layers up and we can create arbitrarily complicated neural networks.

Now when we train these neural nets, obviously we are in a non convex optimization, so initialization may matter.

The method that we use for training these, is a variant of gradient descent, called back propagation.

And back propagation essentially allows us to do gradient descent in this non convex optimization in a reasonably efficient manner.

For more infomation >> Intro to Neural Nets - Duration: 2:50.

-------------------------------------------

Feature Crosses - Duration: 4:04.

Let's start off by looking at a linear problem.

You'll recall that a linear problem is something where we can fit a line to separate, say, this spam from the not spam depending on a couple of input variables.

Here we've got two input variables, X1 and X2, these are input features.

Then we've got a linear model here of W1X1, W2X2, and a bias term.

Now this is easy to fit right?

However, what if our model needed to model something more complicated?

In particular, what if our data in fact looks more like this?

In this world, there's really no way for us to fit a simple linear model that's going to get anything more than about 50% accuracy.

Now what can we do?

Well one clever idea is that we could define an additional feature.

We're going to call this a synthetic feature, or a feature cross; we're going to do it like this, we're going to call it X3.

We're going to define X3 as the product of X1 and X2.

Now can I use X3 in my linear model?

Yes indeed I can; I'm going to go ahead and add a coefficient, W3 for X3.

And you'll notice that the product of X1 and X2 is always positive if they're both positive or if X1 and X2 are both negative.

So I get this very nice ability to pull out my blue dots if the product of X1 and X2 is positive.

Similarly we'll always have a negative if either one of these coordinates is negative.

So this allows us to learn a non-linearity within a linear model, using a simple synthetic feature that's called a cross product.

So the general name of this process of creating synthetic features as products of other features we call feature crosses, feature cross products.

We can think of these as maybe templates of the form A cross B, they can be complex, A cross B cross C cross D.

And when A and B are boolean features, things that we might get from a one hot encoding, like various strings or bins, the resulting crosses might be quite sparse.

That one hot encoding may have an awful lot of zeros in it.

Let's think of some additional examples here of where feature crosses might be useful.

So if we're predicting housing prices in California, we might want to say cross latitude, or binned latitude, with the number of bedrooms.

And this would allow us to learn that maybe having 3 bedrooms in San Francisco is a very different thing from having 3 bedrooms in Sacramento.

You could also think about a tic-tac-toe predictor and you might think of the cross products that you'd be able to use there by crossing the various coordinates in the tic-tac-toe grid.

Because obviously an X or an O in any particular segment of that tic-tac-toe grid by itself isn't very interesting.

But to be able to say your left corner, center, right, bottom all together is a very different thing.

So why would we want feature crosses?

Well the main thing is that this allows us to incorporate non-linear learning into a linear learner.

Now linear learners are very interesting to us, because they scale very naturally to massive scale data sets.

In fact, for many years linear learners were the only method that we really had that would scale to billions or hundreds of billions sized data sets.

Now we also have deep neural networks that can scale well, so that's another option that we'll look at later on in this class.

Finally, there's been some interesting research showing that combining the effects of linear learning through feature cross products and deep networks can be extremely powerful for modeling.

For more infomation >> Feature Crosses - Duration: 4:04.

-------------------------------------------

Static vs. Dynamic Inference - Duration: 2:16.

Another important design consideration when we're designing machine learning systems, is whether we do inference in an offline or an online fashion.

And by inference, of course, we mean making predictions.

Are we gonna do this once offline and write those predictions out to some table or some static place that we can then read from, or are we gonna do it continuously on demand with our model stored in some server and predicting on new data as it comes in with new requests?

Again these have different strengths and weaknesses.

If we're doing offline scoring then we get a really nice opportunity to do post-prediction validation.

We write out all our scores and we can sanity check that these scores make sense before they are used to influence the real world.

We also don't have to worry quite so much about computational costs.

If our model is expensive, if it takes a lot of cost to compute our predictions, well we can just throw more machines at it.

Maybe using batch quota or some giant map produce, not have to worry about things too much.

The drawback to offline scoring is that we need to have all of our examples in hand at the time that we're doing those predictions.

So for areas where we might have a long tail or some crazy distribution that's changing on us, we might not be in such luck there.

But if we know in advance what all of our examples are going to be or if we only care about head queries, something like that, then we can stick those all in a nice offline scoring lookup table.

For online scoring, remember that we're putting that model into a server and then querying that server on demand.

Now this is great because it means that for long tail situations any new example that comes in could be crazy, whatever, we can still get a prediction for it on demand.

However, the latency issues are something that we will need to think about.

A lot of situations are latency sensitive, so we can't take more than five or ten or fifteen milliseconds to make our predictions.

Which means that if our model is expensive to compute we might have to through a large amount of production level resources at our jobs, this can be expensive, we'll need to budget for that.

The other thing to think about is that we may need to increase our game in terms of monitoring.

Not only do we need to monitor the serving jobs themselves, but we also need to monitor the output distributions of our predictions to make sure that nothing is going haywire.

So again there's plusses and minuses to online serving.

For more infomation >> Static vs. Dynamic Inference - Duration: 2:16.

-------------------------------------------

Literature Example - Duration: 2:43.

Let's take a look at another interesting machine learning system that had a little quirk in it.

So, we were working with a professor of 18th century English literature.

Who had developed a database of metaphors of the mind.

Basically, he had gone through all of 18th century literature and every time an author had said: 'The mind is a garden.' or 'The mind is a blank slate', he'd written those down in a database.

And he was interested to know, whether the metaphors to the mind that an author used, were indicative of their political affiliation.

Well it seemed like this is something we can test using machine learning; we'll build a model, see if we can predict these things.

So, we did this.

We created a set of training and test data, where each example was one metaphor of the mind that an author had used, and the label of the author's political affiliation; [unintelligible] or whatever they did in the 18th century.

We divided those up sentence by sentence and to test training and validation splits, trained up a model, and found that our test validation accuracy was incredibly good.

It looked like we could indeed predict an author's political affiliation, based only on their metaphors of the mind.

Total success.

Except, then we looked a little deeper.

And realized that maybe our accuracy was actually suspiciously high.

Was there anything here that was going wrong?

Well, one of the things that we noticed, was that when we created our test training and validation splits, we did so by dividing up our examples, sentence by sentence.

And so, if we looked at say Samuel L. Richardson, some of his sentences would end up in the training data, some in the validation data, and some in the test data.

This meant that the model had the ability to learn specific qualities about Richardson's use of language beyond just the metaphors that he used.

And, in a sense, get to memorize a little extra stuff about him when it came time to be applied at test time.

We ran another experiment, where we divided things up at the author level, so that given authors were only in the training data, or only in the test data, or only in the validation data.

And when we ran the experiment this time, we found that it was much more difficult to get good accuracy on test data and that it's much more difficult to predict the political affiliation based only on the metaphorical data.

So this was kind of interesting.

One thing was we got to write two papers, one with each viewpoint.

The second was that we learned an important lesson about how critical it is to think about the ways that our training data and test data are randomized for these splits.

As a machine learning person, it's easy to think; 'Well, we'll just randomize the data!'.

We actually have to know what the data represents in order to create these good splits.

For more infomation >> Literature Example - Duration: 2:43.

-------------------------------------------

Reducing Loss - Duration: 4:23.

Hi, my name is Cassandra Xia and I'm a programmer at Google

that helps other groups within Google use tensor flow.

In this section we're gonna talk about reducing loss.

Previously we learned how to compute the loss, but how do we choose the set of

model parameters that minimizes it?

Well what would be nice is if we had a direction to go in within parameter space.

Some sort of guide such that each set of new hyper-parameters that we took on

had a lower loss than the one before it.

One way to get a direction is to compute the gradient.

The derivative of the loss function with respect to the model parameters.

For simple loss functions like the square loss the derivative is easy to compute.

And it gives us an efficient way to update model parameters.

Think of it as an iterated approach.

Data comes in, we compute the gradient of the loss function on that data.

The negative gradient tells us in which direction to update model parameters

in order to reduce loss. We take a step in that direction,

get a new version of the model, and now we can recompute the gradient and repeat.

Pretend in one dimension, this is our loss function.

It maps our single model parameter theta to the loss.

If we start off at a random value or initialization for theta

then we achieve a corresponding loss.

We can then compute the negative gradient which tells us

in which direction we should go in order to minimize the loss.

If we take a gradient step in that direction we get a new loss.

We can keep taking gradient steps in that direction until we've reached a point

in which we have passed the local minimum, in which the

negative gradient will tell us to go back in the direction that we came from.

How large of a step should we take in the direction of negative gradient?

Well, that is dictated by the learning rate.

A hyper-parameter that you can twiddle.

If learning rate is really small then we'll take a bunch of teeny tiny gradient steps,

Requiring a lot of computation in order to reach the minimum.

However, if the learning rate is very large then we'll take a large step

in the direction of negative gradient. Potentially overshooting the local minimum

and even reaching a point in which the loss is even bigger than before.

In more dimensions, this would cause your model to diverge.

In which case, you should try

decreasing the learning rate by an order of magnitude or so.

We just described an algorithm called gradient descent.

We start somewhere and we continuously take steps that hopefully get us

closer and closer to some minimum. However, does it matter where we start?

Well let's think for a minute.

If we put ourselves back in calculus class, we learned that

some problems are convex, meaning that they're shaped like a giant bowl.

So as long as we start somewhere on the bowl and we take reasonable step sizes

and follow the gradients, eventually we'll find our way to the bottom of the bowl.

However, many machine learning problems are not convex.

Neural networks are notoriously not convex,

meaning that rather than being shaped like a bowl,

they are shaped more like an egg crate.

Where there are many possible minimum values,

some of which are better than others.

So there initialization does matter.

More on that later.

Let's think for a moment about efficiency.

When we're computing the gradient of the loss function,

math suggests that we should compute the gradient

over all examples in our data set.

This is the only way to guarantee that our gradient steps

are in exactly the right direction.

For large data sets with a million or billion examples,

that would a lot of computation

in order to perform each step.

Empirically, people have found that rather than using the entire

data set, if they compute the gradient of the loss function over a single example

that mostly works too. Even though they'd have to take more overall steps,

the amount of total computation in order to reach a good solution

is often much smaller.

This is called stochastic gradient descent.

In practice, we adopt an intermediate solution.

Rather than use one example or the entire data set,

we use a small batch, somewhere between ten and a thousand examples

to perform our steps. This is called mini-batch gradient descent.

For more infomation >> Reducing Loss - Duration: 4:23.

-------------------------------------------

Training Neural Nets - Duration: 2:54.

So when we think about how to train neural networks, what do you need to know about, say, back propagation?

One thing you don't need to know about back prop is how to implement it.

That's one of the brilliant things that TensorFlow does for us, is it takes the internals of back propagation and does that all for us underneath the hood.

But there are some important things to know.

The first is that back prop really does rely on this idea of gradience, things needs to be differentiable for us to be able to learn on them.

One or two small discontinuities in our various functions are fine, but in general we need differentiable functions to be able to learn with neural nets.

Other things that gradients can vanish.

If our networks get too deep, so if signal to noise ratios get bad as you go further and further down the model and learning can really become quite slow.

Ray lou's can be useful there; there are also some other strategies that we won't talk about in this class.

But in general you do want to think about limiting the depth of your model to sort of the minimum effective depth if you can.

It's also important to know that gradients can explode; if our learning rates are too high, we get these sort of crazy instabilities, we can get NaNs in our model.

The thing to do there is to try again with a lower learning rate.

Last thing to know is that ray lous can die.

It's possible that because we have this hard cap at zero, if we end up with everything below that value of zero there's no way for gradients to get propagated back through and we'll never be able to pull ourselves back up into the land of living ray lou layers.

So keep an eye out for those and again try again with a different initialization or a lower learning rate.

At training time, it's often very useful for us to have normalized feature values when they come in.

If things are on roughly the same scale, this can help speed the conversions of neural nets.

So the exact value of the scale doesn't really matter; we often recommend negative one to plus one as an approximate range.

It could minus five to plus five, or zero to one, it doesn't really matter so long as all of our inputs are on roughly the same scale.

Finally, one last trick that's useful in training deep networks is the idea of an additional form of regularization that is called dropout.

And dropout is kind of a funny idea.

When we apply dropout, what we're saying is that with probability P we take a node and we essentially remove it from the network for a single gradient step.

On different gradient steps, we repeat and we'll take different nodes to drop out randomly.

So the more you dropout, the stronger regularization you have.

And you can kind of see this clearly where if you drop everything out you have an extremely simple model that is essentially useless.

If you drop out nothing, you allow the model to have its full complexity and if you have dropout somewhere in the middle, you're applying some sort of useful regularization there.

Dropout is one of the key advances that has enabled a number of the strong results that we've gotten recently that has pushed deep learning to the forefront.

For more infomation >> Training Neural Nets - Duration: 2:54.

-------------------------------------------

Production ML Systems - Duration: 1:04.

So far we've spent a lot of time talking about how to build models that do a good job of predicting on new, previously unseen data.

And this is of course the heart of any machine learning system.

But a machine learning system needs to be able to make predictions that influence the real world and to do so they need to be part of a much larger ecosystem, than just a little black box that does machine learning.

We look at a full machine learning system, it includes many components that are not about training.

For example we need to have data collection, we need to have feature extraction, data verification, various forms of monitoring and data analysis.

We also need to get these things out to serving so that they can make predictions that are used in the real world.

Now these are a lot of different pieces.

Fortunately you don't have to build all these pieces yourself, there's a large number of components that can be pulled off the shelf and used in your application setting.

But how do you know which of these components are the right ones for your particular task?

Turns out that there are a couple of key design choices we can look at that will help guide these choices.

We'll take a look at these now.

For more infomation >> Production ML Systems - Duration: 1:04.

-------------------------------------------

Cancer Example - Duration: 2:02.

So let's look at a real world example of a machine learning system that had an interesting issue.

So this was in the case of cancer prediction, there was a model a couple of years ago that was trained up to predict the probability that a patient had cancer from medical records.

So in those records we extracted features, things like patient age, gender, medical conditions, hospital name, vital signs, test results, you know, all that kind of stuff.

And the folks who were trained in this model felt like geniuses, because once they trained that model it gave outstanding results on their held out test data, and they were extremely careful not to mix any test and training data.

However when they applied the model to new patients, the model did terribly.

Couldn't figure out what was going wrong, what could have happened there?

Well, here's what happened: one of the features that was included, as we said, was the name of the hospital.

Some of those names were things like "Beth Israel Cancer Center".

Turns out that that's an extremely useful indicator to know if the patient in consideration in fact is suffering from cancer.

Now, even if that string didn't contain the word "cancer", it's the case that many hospitals have different specialties.

So some hospitals specialize in cancer treatment, some do not.

Even if you were to take that "Beth Israel Cancer Center" and turn that into an anonymized integer, there would still be many patients that were strongly correlated with that particular hospital, because that was a hospital that specialized in cancer treatment.

But for new patients that haven't been assigned to a hospital yet, we don't know that information.

Turns out that showing the model the name of the hospital, is a subtle form of cheating.

It's a subtle way of exposing a doctors diagnosis to the model in a way that they wouldn't have available if they were trying to be acting in place of the doctor.

So we call this "label leakage", where a little bit of the training labels leak through into the features and allow the model to cheat.

This is a classic failure mode that's important to avoid.

Không có nhận xét nào:

Đăng nhận xét