YT for next: Youtube daily Aug 11 2017

- Hello everyone, welcome to CS231.

I'm Song Han. Today I'm going to give a guest lecture

on the efficient methods and hardware for deep learning.

So I'm a fifth year PhD candidate here at Stanford,

advised by Professor Bill Dally.

So, in this course we have seen a lot of convolution neural

networks, recurrent neural networks, or even

since last time, the reinforcement learning.

They are spanning a lot of applications.

For example, the self-=driving car, machine translation,

AlphaGo and Smart Robots.

And it's changing our lives, but there is a recent

trend that in order to achieve such high accuracy,

the models are getting larger and larger.

For example for ImageNet recognition, the winner from

2012 to 2015, the model size increased by 16X.

And just in one year, for Baidu's deep speech

just in one year, the training operations, the number

of training operations increased by 10X.

So such large model creates lots of problems,

for example the model size becomes larger and larger

so it's difficult for them to be deployed either

on those for example, on the mobile phones.

If the item is larger than 100 megabytes, you

cannot download until you connect to Wi-Fi.

So those product managers and for example Baidu,

Facebook, they are very sensitive to the size of the binary

size of their model.

And also for example, the self-driving car, you can only

do those on over-the-air update for the model

if the model is too large, it's also difficult.

And the second challenge for those large models is

that the training speed is extremely slow.

For example, the ResNet152, which is only a few, less

than 1% actually, more accurate than ResNet101.

Takes 1.5 weeks to train on four Maxwell

M40 GPUs for example.

Which greatly limits either we are doing homework

or if the researcher's designing new models is

getting pretty slow.

And the third challenge for those bulky model is

the energy efficiency.

For example, the AlphaGo beating Lee Sedol last year,

took 2000 CPUs and 300 GPUs, which cost $3,000

just to pay for the electric bill, which is insane.

So either on those embedded devices, those models

are draining your battery power for on data-center

increases the total cost of ownership of maintaining

a large data-center.

For example, Google in their blog, they mentioned

if all the users using the Google Voice Search for

just three minutes, they have to double their data-center.

So that's a large cost.

So reducing such cost is very important.

And let's see where is actually the energy consumed.

The large model means lots of memory access.

You have to access, load those models from the memory

means more energy.

If you look at how much energy is consumed by loading

the memory versus how much is consumed by multiplications

and add those arithmetic operations, the memory access

is more than two or three orders of magnitude,

more energy consuming than those arithmetic operations.

So how to make deep learning more efficient.

So we have to improve energy efficiency by this

Algorithm and Hardware Co-Design.

So this is the previous way, which is our hardware.

For example, we have some benchmarks say Spec 2006

and then run those benchmarks and tune your CPU

architectures for those benchmarks.

Now what we should do is to open up the box to see

what can we do from algorithm side first and see what

is the optimum question mark processing unit.

That breaks the boundary between the algorithm

hardware to improve the overall efficiency.

So today's talk, I'm going to have the following agenda.

We are going to cover four aspects: The algorithm hardware

and inference and training.

So they form a small two by two matrix, so includes the

algorithm for efficient inference,

hardware for efficient inference

and the algorithm for efficient training,

and lastly, the hardware for efficient training.

For example, I'm going to cover the TPU, I'm

going to cover the Volta.

But before I cover those things, let's have three

slides for Hardware 101.

A brief introduction of the families of hardware

in such a tree.

So in general, we can have roughly two branches.

One is general purpose hardware.

It can do any applications versus the specialized

hardware, which is tuned for a specific kind of

applications, a domain of applications.

So the general purpose hardware includes, the CPU

or the GPU, and their difference is that CPU is

latency oriented, single threaded.

It's like a big elephant.

While the GPU is throughput oriented.

It has many small though weak threads, but there

are thousands of such small weak cores.

Like a group of small ants, where there are so many ants.

And specialized hardware, roughly there are FPGAs and ASICs.

So FPGA stand for Field Programmable Gate Array.

So it is programmable, hardware programmable so its

logic can be changed.

So it's cheaper for you to try new ideas and do prototype,

but it's less efficient.

It's in the middle between the general purpose and

pure ASIC.

So ASIC stands for Application Specific Integrated Circuit.

It has a fixed logic, just designed

for a certain application.

For example deep learning.

And Google's TPU is a kind of ASIC and the neural networks

we train on, the earlier GPUs is here.

And another slide for Hardware 101 is the number

representations.

So in this slide, I'm going to convey you the idea that

all the numbers in computer are not represented

by a real number.

It's not a real number, but they are actually discrete.

Even for those floating point with your 32 Bit.

Floating point numbers, their resolution is not perfect.

It's not continuous, but it's discrete.

So for example FP32, meaning using a 32 bit to represent

a floating point number.

So there are three components in the representation.

The sign bit, the exponent bit, the mantissa,

and the number it represents is shown by minus 1 to the S

times 1.M times 2 to the exponent.

So similar there is FP16, using a 16 bit to represent

a floating point number.

In particular, I'm going to introduce Int8, where

the core TPU use, using an integer to represent a fixed

point number.

So we have a certain number of bits for the integer.

Followed by a radix point, if we put different layers.

And lastly, the fractional bits.

So why do we prefer those eight bit, or 16 bit

rather than those traditional like the

32 bit floating point.

That's the cost.

So, I generated the figure from 45 nanometer technology

about the energy cost versus the area cost for different

operations.

In particular, let's see here, go you from 32 bit to

16 bit, we have about four times reduction in energy

and also about four times reduction in the area.

Area means money.

Every millimeter square takes money to take out a chip

So it's very beneficial for hardware design to go from

32 bit to 16 bit.

That's why you hear NVIDIA from Pascal Architecture,

they said they're starting to support FP16.

That's the reason why it's so beneficial.

For example, previous battery level could last four hours,

now it becomes 16 hours.

That's what it means to reduce

the energy cost by four times.

But here still, there's a problem of large energy costs

for reading the memory.

And let's see how can we deal with this memory reference

so expensive, how do we deal with this problem better?

So let's switch gear and come to our topic directly.

So let's first introduce algorithm for efficient inference.

So I'm going to cover six topics, this is a really long slide.

So I'm going to relatively fast.

So the first idea I'm going to talk about is pruning.

Pruning the neural networks.

For example, this is original neural network.

So what I'm trying to do is, can we remove some of the

weight and still have the same accuracy?

It's like pruning a tree, get rid

of those redundant connections.

This is first proposed by Professor Yann LeCun back in 1989,

and I revisited this problem, 26 years later, on those

modern deep neural nets to see how it works.

So not all parameters are useful actually.

For example, in this case, if you want to fit a single line,

but you're using a quadratic term, apparently the

0.01 is a redundant parameter.

So I'm going to train the connectivity first and then

prune some of the connections.

And then train the remaining weights,

and through this process, it regulates.

And as a result, I can reduce the number of connections,

and annex that from 16 million parameters to only

six million parameters, which is 10 times less

the computation.

So this is the accuracy.

So the x-axis is how much parameters to prune away

and the y-axis is the accuracy you have.

So we want to have less parameters, but we also

want to have the same accuracy as before.

We don't want to sacrifice accuracy,

For example at 80%, we locked zero away left 80%

of the parameters, but accuracy jumped by 4%.

That's intolerable.

But the good thing is that if we retrain the remaining

weights, the accuracy can fully recover here.

And if we do this process iteratively

by pruning and retraining, pruning and retraining,

we can fully recover the accuracy not until we are

prune away 90% of the parameters.

So if you go back to home and try it on your Ipad

or notebook, just zero away 50% of the parameters say

you went on your homework, you will astonishingly find

that accuracy actually doesn't hurt.

So we just mentioned convolution neural nets,

how about RNNs and LSTMs, so I tried with this neural talk.

Again, pruning away 90% of the rates doesn't hurt the

blue score.

And here are some visualizations.

For example, the original picture, the neural talk says

a basketball player in a white uniform is playing

with a ball.

Versus pruning away 90% it says, a basketball player

in a white uniform is playing with a basketball.

And on and so on.

But if you're too aggressive, say you prune away

95% of the weights, the network is going to get drunk.

It says, a man in a red shirt and white and black shirt

is running through a field.

So there's really a limit, a threshold, you have to

take care of during the pruning.

So interestingly, after I did the work, did some

resource and research and find actually the same

pruning procedure actually happens to human brain

as well.

So when we were born, there are about 50 trillion synapses

in the brain.

And at one year old, this number surged into 1,000 trillion.

And as we become adolescent, it becomes smaller actually,

500 trillion in the end, according to the study by Nature.

So this is very interesting.

And also, the pruning changed the weight distribution

because we are removing those small connections

and after we retrain them, that's why it becomes soft

in the end.

Yeah, question.

- [Student] Are you trying to mean that it terms

of your mixed weights during the training will be

just set at zero and just start from scratch?

And these start from the things that are at zero.

- Yeah. So the question is, how do we deal with those

zero connections?

So we force them to be zero in all the other iterations.

Question?

- [Student] How do you pick which rates to drop?

- Yeah so very simple. Small weights, drop it, sort it.

If it's small, just--

- [Student] Any threshold that I decide?

- Exactly, yeah.

So the next idea, weight sharing.

So now we have, remember our end goal is to remove

connections so that we can have less memory footprint

so that we can have more energy efficient deployment.

Now we have less number of parameters by pruning.

We want to have less number of bits per parameter

so they're multiplied together they get a small model.

So the idea is like this.

Not all numbers, not all the weights

has to be the exact number.

For example, 2.09, 2.12 or all these four weights, you

just put them using 2.0 to represent them.

That's enough.

Otherwise too accurate number is just leads to overfitting.

So the idea is I can cluster the weights if they

are similar, just using a centroid to represent

the number instead of using the full precision weight.

So that every time I do the inference, I just do inference

on this single number.

For example, this is a four by four weight matrix

in a certain layer.

And what I'm going to do is do k-means clustering by having

the similar weight sharing the same centroid.

For example, 2.09, 2.12, I store index of

three pointing to here.

So that, the good thing is we need to only store the

two bit index rather than the 32 bit, floating point number.

That's 16 times saving.

And how do we train such neural network?

They are binded together, so after we get the gradient,

we color them in the same pattern as the weight

and then we do a group by operation by having all

the in that weights with the same index grouped together.

And then we do a reduction by summing them up.

And then multiplied by the learning rate

subtracted from the original centroid.

That's one iteration of the SGD for such weight

shared neural network.

So remember previously, after pruning this is

what the weight distribution like and after

weight sharing, they become discrete.

There are only 16 different values here, meaning

we can use four bits to represent each number.

And by training on such weight shared neural network,

training on such extremely shared neural network,

these weights can adjust.

It is the subtle changes that compensated for the

loss of accuracy.

So let's see, this is the number of bits we give it,

this is the accuracy for convolution layers.

Not until four bits, does the accuracy begin to drop

and for those fully connected layers, very astonishingly,

it's not until two bits, only four number, does the

accuracy begins to drop.

And this result is per layer.

So we have covered two methods, pruning and weight sharing.

What if we combine these two methods together.

Do they work well?

So by combining those methods, this is the compression

ratio with the smaller on the left.

And this is the accuracy.

We can combine it together and make the model

about 3% of its original size without hurting the

accuracy at all.

Compared with the each working individual data by

10%, accuracy begins to drop.

And compared with the cheap SVD method,

this has a better compression ratio.

And final idea is we can apply the Huffman Coding

to use more number of bits for those infrequent numbers,

infrequently appearing weights and less number of bits

for those more frequently appearing weights.

So by combining these three methods, pruning, weight

sharing, and also Huffman Coding, we can compress the

neural networks, state-of-the-art neural networks,

ranging from 10x to 49x without hurting the

prediction accuracy.

Sometimes a little bit better.

But maybe that is noise.

So the next question is, these models are just pre-trained

models by say Google, Microsoft.

Can we make a compact model, a pump compact model

to begin with?

Even before such compression?

So SqueezeNet, you may have already worked with this

neural network model in a homework.

So the idea is we are having a squeeze layer here to shield

at the three by three convolution with fewer number of

channels.

So that's where squeeze comes from.

And here we have two branches, rather than four branches

as in the inception model.

So as a result, the model is extremely compact.

It doesn't have any fully connected layers.

Everything is fully convolutional.

The last layer is a global pooling.

So what if we apply deep compression algorithm

on such already compact model will it be getting even

smaller?

So this is AlexNet after compression, this is SqueezeNet.

Even before compression, it's 50x smaller than AlexNet,

but has the same accuracy.

After compression 510x smaller, but the same accuracy

only less than half a megabyte.

This means it's very easy to fit such a small model

on the cache, which is literally

tens of megabyte SRAM.

So what does it mean?

It's possible to achieve speed up.

So this is the speedup, I measured if all these fully

connected layers only for now, on the CPU, GPU, and

the mobile GPU, before pruning

and after pruning the weights,

and on average, I observed a 3x speedup in a CPU,

about 3X speedup on the GPU,

and roughly 5x speedup on the mobile GPU, which is a

TK1.

And so is the energy efficiency.

In an average improvement from 3x to 6x on a CPU, GPU,

and mobile GPU.

And these ideas are used in these companies.

Having talked about when pruning and when sharing,

which is a non-linear quantization method

and we're going to talk about quantization, which is, why

do they use in the TPU design?

All the TPU designs use at only eight bit for inference.

And the way, how they can use that is because of the

quantization.

And let's see how does it work.

So quantization has this complicated figure, but

the intuition is very simple.

You run the neural network and train it with the normal

floating point numbers.

And quantize the weight and activations by gather

the statistics for each layer.

For example, what is the maximum number, minimum number,

and how many bits are enough

to represent this dynamic range.

Then you use that number of bits for the integer part

and the rest of the eight bit or seven bit

for the other part of the 8 bit representation.

And also we can fine tune in the floating point format.

Or we can also use feed forward with fixed point

and back propagation with update with the floating

point number.

There are lots of different ideas to have better accuracy.

And this is the result, for how many number of bits

versus what is the accuracy.

For example, using a fixed, 8 bit, the accuracy for

GoogleNet doesn't drop significantly.

And for VGG-16, it also remains pretty well for

the accuracy.

While circling down to a six bit, the accuracy

begins to drop pretty dramatically.

Next idea, low rank approximation.

It turned out that for a convolution layer,

you can break it into two convolution layers.

One convolution here, followed by a one by one convolution.

So that it's like you break a complicated problem

into two separate small problems.

This is for convolution layer.

As we can see, achieving about

2x speedup, there's almost no loss of accuracy.

And achieving a speedup of 5x, roughly a 6%

loss of accuracy.

And this also works for fully connected layers.

The simplest idea is using the SVD to break it into

one matrix into two matrices.

And follow this idea, this paper proposes to use the

Tensor Tree to break down one fully connected layer into

a tree, lots of fully connected layers.

That's why it's called a tree.

So going even more crazy, can we use only

two weights or three weights to represent a neural network?

A ternary weight or a binary weight.

We already seen this distribution before, after pruning.

There's some positive weights and negative weights.

Can we just use three numbers, just use one, minus one, zero

to represent the neural network.

This is our recent paper clear that we maintain

a full precision weight during training time,

but at inference time, we only keep the scaling factor

and the ternary weight.

So during inference, we only need three weights.

That's very efficient and making the model very small.

This is the proportion of the positive zero

and negative weights, they can change during the training.

So is their absolute value.

And this is the visualization of kernels

by this trained ternary quantization.

We can see some of them are a corner detector like here.

And also here.

Some of them are maybe edge detector.

For example, this filter some of them

are corner detector like here this filter.

Actually we don't need such fine grain resolution.

Just three weights are enough.

So this is the validation accuracy on ImageNet with AlexNet.

So the threshline is the baseline accuracy

with floating point 32.

And the red line is our result.

Pretty much the same accuracy converged compared with

the full precision weights.

Last idea, Winograd Transformation.

So this about how do we implement deep neural nets,

how do we implement the convolutions.

So this is the conventional direct

convolution implementation method.

The slide credited to Julien, a friend from Nvidia.

So originally, we just do the element wise

do a dot product for those nine elements in the filter

and nine elements in the image and then sum it up.

For example, for every output we need nine times C

number of multiplication and adds.

Winograd Convolution is another method, equivalent method.

It's not lost, it's an equivalent method proposed at

first through this paper, Fast Algorithms

for Convolution Neural Networks.

That instead of directly doing the convolution, move

it one by one, at first it transforms the input feature

map to another feature map.

Which contains only the weight, contains only 1, 0.5, 2

that can efficiently implement it with shift.

And also transform the filter into a four by four tensor.

So what we are going to do here is sum over c and do an element-wise

element-wise product.

So there are only 16 multiplications happening here.

And then we do a inverse transform to get four outputs.

So the transform and the inverse transform can be

amortized and the multiplications, whether it can ignored.

So in order to get four output, we need nine times channel

times four, which is 36 times channel.

Multiplications originally for the direct convolution

but now we need 16 times C of our output

So that is 2.25x less number of multiplications to

perform the exact same multiplication.

And here is a speedup.

2.25x, so theoretically, 2.25x speedup and in real,

from cuDNN 5 they incorporated such

Winograd Convolution algorithm.

This is on the VGG net I believe, the speedup is

roughly 1.7 to 2x speedup.

Pretty significant.

And after cuDNN 5, the cuDNN begins to use the

Winograd Convolution algorithm.

Okay, so far we have covered those efficient algorithms

for efficient inference.

We covered pruning, weight sharing, quantization,

and also Winograd binary and ternary.

So now let's see what is the optimal hardware for those

efficient inference?

And what is a Google TPU?

So there are a wide range of domain specific

architectures or ASICS for deep neural networks.

They have a common goal is to minimize the memory

access to save power.

For example the Eyeriss from MIT by using the RS Dataflow

to minimize the off chip direct access.

And DaDiannao from China Academy of Science,

buffered all the weights on chip DRAM instead of having

to go to off-chip DRAM.

So the TPU from Google is using eight bit integer

to represent the numbers.

And at Stanford I proposed the EIE architecture

that support those compressed and

sparse deep neural network inference.

So this is what the TPU looks like.

It's actually smartly, can be put into the disk drive

up to four cards per server.

And this is the high-level architecture

for the Google TPU.

Don't be overwhelmed, it's actually, the kernel part

here, is this giant matrix multiplication unit.

So it's a 256 by 256 matrix multiplication unit.

So in one single cycle, it can perform 64 kilo

those number of multiplication and accumulate operations.

So running 700 Megahertz, the throughput is 92

Teraops per second

because it's actually integer operation.

So we just about 25x as GPU and more than 100x at the CPU.

And notice, TPU has a really large software-managed

on-chip buffer.

It is 24 megabytes.

The cache for the CPU the L3 cache is already

16 megabytes.

This is 24 megabytes which is pretty large.

And it's powered by two DDR3 DRAM channels.

So this is a little weak because the bandwidth is

only 30 gigabytes per second compared with the most

recent GPU that HBM, 900 Gigabytes per second.

The DDR4 is released in 2014, so that makes sense because

the design is a little during that day, used the DDR3.

But if you're using DDR4 or even high-bandwidth memory,

the performance can be even boosted.

So this is a comparison about Google's TPU compared

with the CPU, GPU of this K80 GPU by the way, and the TPU.

So the area is pretty much smaller, like half the size of a

CPU and GPU and the power consumption is roughly 75 watts.

And see this number, the peak teraops per second

is much higher than the CPU and GPU is, about 90

teraops per second, which is pretty high.

So here is a workload.

Thanks to David sharing the slide.

This is the workload at Google.

They did a benchmark on these TPUs.

So it's a little interesting that convolution neural nets

only account for 5% of data-center workload.

Most of them is multilayer perception,

those fully connected layers.

About 61% maybe for ads, I'm not sure.

And about 29% of the workload in data-center is the

Long Short Term Memory.

For example, speech recognition,

or machine translation, I suspect.

Remember just now we have seen there are

90 teraops per second.

But what actually number of teraops per second

can be achieved?

This is a basic tool to measure the bottleneck

of a computer system.

Whether you are bottlenecked by the arithmetic or

you are bottlenecked by the memory bandwidth.

It's like if you have a bucket,

the lowest part of the bucket determines how much

water we can hold in the bucket.

So in this region, you are bottlenecked

by the memory bandwidth.

So the x-axis is the arithmetic intensity.

Which is number of floating point operations per byte

the ratio between the computation and memory

of bandwidth overhead.

So the y-axis, is the actual attainable performance.

Here is the peak performance for example.

When you do a lot of operation after you fetch a single

piece of data, if you can do a lot of operation

on top of it, then you are bottlenecked by the arithmetic.

But after you fetch a lot of data from the memory,

but you just do a tiny little bit of arithmetic,

then you will be bottlenecked by the memory bandwidth.

So how much you can fetch from the memory determines

how much real performance you can get.

And remember there is a ratio.

When it is one here, this region it happens to be the same

as the turning point is the actual

memory bandwidth of your system.

So let's see what is the life for the TPU.

The TPU's peak performance is really high,

about 90 Tops per second.

For those convolution nets, they are pretty much saturating

the peak performance.

But there are lot of neural networks that has a utlitization

less than 10%,

meaning that 90 T-ops per second is actually

achieves about three to 12 T-ops per second in real case.

But why is it like that?

The reason is, in order to have those real-time guarantee

that the user not wait for too long, you cannot batch

a lot of user's images or speech voice data

at the same time.

So as a result, for those fully connect layers,

they have very little reuse, so they are bottlenecked

by the memory bandwidth.

For those convolution neural nets, for example this one,

this blue one, that achieve 86, which is CNN0.

The ratio between the ops and

the number of memory is the highest.

It's pretty high, more than 2,000 compared with other

multilayer perceptron or long short term memory

the ratio is pretty low.

So this figure compares, this is the TPU and this one is

the CPU, this is the GPU.

Here is memory bandwidth, the peak memory bandwidth

at a ratio of one here.

So TPU has the highest memory bandwidth.

And here is where are these neural networks

lie on this curve.

So the asterisk is for the TPU.

It's still higher than other dots,

but if you're not comfortable with this log scale figure,

this is what it's like putting it in linear roofline.

So pretty much everything disappeared except

for the TPU results.

So still, all these lines, although they are higher

than the CPU and GPU, it's still way below the

theoretical peak operations per second.

So as I mentioned before, it is really bottlenecked

by the low latency requirement so that it can have

a large batch size.

That's why you have low operations per byte.

And how do you solve this problem?

You want to have less number of memory footprint

so that it can reduce the memory bandwidth requirement.

One solution is to compress the model and the challenge

is how do we build a hardware that can do inference

directly on the compressed model?

So I'm going to introduce my design of EIE, the Efficient

Inference Engine, which deals with those sparse

and the compressed model to save the memory bandwidth.

And the rule of thumb, like we mentioned before is taking

out one bit of sparsity first.

Anything times zero is zero.

So don't store it, don't compute on it.

And second idea is, you don't need that much full precision,

but you can approximate it.

So by taking advantage of the sparse weight, we

get about a 10x saving in the computation, 5x less

memory footprint.

The 2x difference is due to index overhead.

And by taking advantage of the sparse activation,

meaning after bandwidth, if activation is zero, then

ignore it.

You save another 3x of computation.

And then by such weight sharing mechanism,

you can use four bits to represent each weight rather

than 32 bit.

That's another eight times saving in the memory footprint.

So this is physically, logically how the weights are stored.

A four by eight matrix, and this is how physically

they are stored.

Only the non-zero weights are stored.

So you don't need to store those zeroes.

You'll save the bandwidth fetching those zeroes.

And also I'm using the relative index to further save

the number of memory overhead.

So in the computation like this figure shows,

we are running the multiplication only on non-zero.

If it's zero, then skip it.

Only broadcast it to the non-zero weights

and if it is zero, skip it.

If it's a non-zero, do the multiplication.

In another cycle, do the multiplication.

So the idea is anything multiplied by zero is zero.

So this is a little complicated,

I'm going to go very quickly.

I'm going to have a lookup table that decode the four bit

weight into the 16 bit weight and using the four bit

relative index passed through address accumulator

to get the 16 bit absolute index.

And this is what the hardware architecture

like in the high level.

You can feel free to refer to my paper for detail.

Okay speedup.

So using such efficient hardware architecture

and also model compression, this is the original

result we have seen for CPU, GPU, mobile GPU.

Now EIE is here.

189 times faster than the CPU and about 13 times faster

than the GPU.

So this is the energy efficiency on the log scale,

it's about 24,000x more energy efficient than a CPU

and about 3000x more energy efficient than a GPU.

It means for example, previously if your battery can

last for one hour, now it can last for

3000 hours for example.

So if you say, ASIC is always better than CPUs and GPUs

because it's customized hardware.

So this is comparing EIE with the peer ASIC, for example

DaDianNao and the TrueNorth.

It has a better throughput, better energy efficiency

by order of magnitude, compared with other ASICs.

Not to mention that CPU, GPU and FPGAs.

So we have covered half of the journey.

We mentioned inference, we pretty much

covered everything for inference.

Now we are going to switch gear and talk about training.

How do we train neural networks efficiently,

how do we train it faster?

So again, we are starting with algorithm first,

efficient algorithms followed by the hardware

for efficient training.

So for efficient training algorithms, I'm going to mention

four topics.

The first one is parallelization, and then mixed precision

training, which was just released about one month ago

and at NVIDIA GTC, so it's fresh knowledge.

And then model distillation, followed by my work on

Dense-Sparse-Dense training, or better Regularization

technique.

So let's start with parallelization.

So this figure shows, anyone in the hardware community.

Most are very familiar with this figure.

So as time goes by, what is the trend?

For the number of transistors is keeping increasing.

But the single threaded performance is getting plateaued

in recent years.

And also the frequency is getting plateaued in recent years.

Because of the power constraint, to stop not scaling.

And interesting thing is the number of cores is increasing.

So what we really need to do is parallelization.

How do we parallelize the problem to take advantage

of parallel processing?

Actually there are a lot of opportunities for parallelism

in deep neural networks.

For example, we can do data parallel.

For example, feeding two images into the same model

and run them at the same time.

This doesn't affect latency for a single input.

It doesn't make it shorter, but it makes batch size larger

basically if you have four machines our effective batch

size becomes four times as before.

So it requires the coordinated weight update.

For example, this is a paper from Google.

There is a parameter server as a master and a couple of

slaves running their own piece of training data and update

the gradient to the parameter server and get the updated

weight for them individually,

that's how data parallelism is handled.

Another idea is there could be a model parallelism.

You can sublet your model and handle it

to different processors or different threads.

For example, there's this image, you want to run convolution

on this image that is six dimension for loop.

What you can do is you can cut the input image by

two by two blocks so that each thread, or each processor

handles one fourth of the image.

Although there's a small halo here in between you

have to take care of.

And also, you can parallelize by the

output or input feature map.

And for those fully connect layers,

how do we parallelize the model?

It's even simpler.

You can cut the model into half

and hand it to different threads.

And the third idea, you can even do

hyper-parameter parallel.

For example, you can tune your learning rate, your

weight decay for different machines

for those coarse-grained parallelism.

So there are so many alternatives you have to tune.

Small summary of the parallelism.

There are lots of parallelisms in deep neural networks.

For example, with data parallelism, you can run multiple

training images, but you cannot have unlimited number

of processors because you are limited by batch size.

If it's too large, stochastic gradient descent

becomes gradient descent, that's not good.

You can also run the model parallelism.

Split the model, either by cutting the image or

cutting the convolution weights.

Either cutting the image or cutting

the fully connected layers.

So it's very easy to get 16 to 64 GPUs training one model

in parallel, having very good speedup.

Almost linear speedup.

Okay, next interesting thing, mixed precision with

FP16 or FP32.

So remember in the beginning of this lecture,

I had a chart showing the energy and area overhead for

a 16 bit versus a 32 bit.

Going from 32 bit to 16 bit, you save about 4x the energy

and 4x the area.

So can we train a deep neural network with such low

precision with floating point 16 bit rather than 32 bit?

It turns out we can do that partially.

By partially, I mean we need FP32 in some places.

And where are those places?

So we can do the multiplication in 16 bit as input.

And then we have to do the summation

in 32 bit accumulation.

And then convert the result to 32 bit to store the weight.

So that's where the mixed precision comes from.

So for example, we have a master weight stored in

floating point 32, we down converted it to floating

point 16 and then we do the feed forward with 16 bit

weight, 16 bit activation, we get a 16 bit activation

here in the end when we are doing back propagation

of the computation is also done with floating point 16 bit.

Very interesting here, for the weights we get a floating

point 16 bit gradient here for the weight.

But when we are doing the update, so W plus learning

rate times the gradient, that operation has

to be done in 32 bit.

That's where the mixed precision is coming from.

And see there are two colors, which here is 16 bit,

here is the 32 bit.

That's where the mixed precision comes from.

So does such low precision sacrifice your prediction

accuracy for your model?

So this is the figure from NVIDIA just released a couple

of weeks ago actually.

Thanks to Paulius giving me the slide.

The convergence between floating point 32 versus

the multi tensor up, which is basically the mixed

precision training, are actually pretty much

the same for convergence.

If you zoom it in a little bit,

they are pretty much the same.

And for ResNet, the mixed precision sometimes behaves

a little better than the full precision weight.

Maybe because of noise.

But in the end, after you train the model, this is

the result of AlexNet, Inception V3, and ResNet-50

with FP32 versus FP16 mixed precision training.

The accuracy is pretty much the same

for these two methods.

A little bit worse, but not by too much.

So having talked about the mixed precision training,

the next idea is to train with model distillation.

For example, you can have multiple neural networks,

Googlenet, Vggnet, Resnet for example.

And the question is, can we take advantage of these

different models?

Of course we can do model ensemble, can we utilitze them

as teacher, to teach a small junior neural network to have

it perform as good as the senior neural network.

So this is the idea.

You have multiple large powerful senior neural networks

to teach this student model.

And hopefully it can get better results.

And the idea to do that is, instead of using this

hard label, for example for car, dog, cat, the probability

for dog is 100%, but the output of the geometric

ensemble of those large teacher neural networks

maybe the dog has 90% and the cat is about 10%,

and the magic happens here.

You want to have a softened result label here.

For example, the dog is 30%, the cat is 20%.

Still the dog is higher than the cat.

So the prediction is still correct, but it uses

this soft label to train the student neural network

rather than use this hard label to train

the student neural network.

And mathematically, you control how much do you make

it soft by this temperature during the soft max

controlling by this temperature.

And the result is that, starting with the trained model

that classifies 58.9% of the test frames correctly,

the new model converges to 57%.

Only train on 3% of the data.

So that's the magic for model distillation

using this soft label.

And the last idea is my recent paper using

a better regularization to train deep neural nets.

We have seen these two figures before.

We pruned the neural network, having less number

of weights, but have the same accuracy.

Now what I did is to recover and to retrain those

weights shown in red and make everything train

out together to increase the model capacity after

it is trained at a low dimensional space.

It's like you learn the trunk first and then gradually

add those leaves and learn everything together.

It turns out, on ImageNet it performs relatively about 1% to

4% absolute improvement of accuracy.

And is also general purpose, works on long-short term memory

and also recurrent neural nets collaborated with Baidu.

So I also open sourced this special training model

on the DSD Model Zoo, where there are trained, all

these models, GoogleNet, VGG, ResNet, and also SqueezeNet,

and also AlexNet.

So if you are interested, feel free to check out this

Model Zoo and compare it with the Caffe Model Zoo.

Here's some examples on dense-spare-dense training helps

with image capture.

For example, this is a very challenging figure.

The original baseline of neural talk says a boy in

a red shirt is climbing a rock wall.

And the sparse model says a young girl is jumping

off a tree, probably mistaking the hair with either

the rock or the tree.

But then sparse-dense training by using this kind of

regularization on a low dimensional space, it says

a young girl in a pink shirt is swinging on a swing.

And there are a lot of examples due to the limit of time,

I will not go over them one by one.

For example, a group of people are standing in front

of a building, there's no building.

A group of people are walking in the park.

Feel free to check out the paper and see more interesting

results.

Okay finally, we come to hardware for efficient training.

How to we take advantage of the algorithms

we just mentioned.

For example, parallelism, mixed precision, how are

the hardware designed to actually

take advantage of such features.

First GPUs, this is the Nvidia PASCAL GPU, GP100,

which was released last year.

So it supports up to 20 Teraflops on FP16.

It has 16 gigabytes of high bandwidth memory.

750 gigabytes per second.

So remember, computation and memory bandwidth are

the two factors determines your overall performance.

Whichever is lower, it will suffer.

So this is a really high bandwidth, 700 gigabytes

compared with DDR3 is just 10 or 30 gigabytes per second.

Consumes 300 Watts and

it's done in 16 nanometer process

and have a 160 gigabytes per second NV Link.

So remember we have computation, we have memory,

and the third thing is the communication.

All three factors has to be balanced in order to

achieve a good performance.

So this is very powerful, but even more exciting,

just about a month ago, Jensen released the newest

architecture called the Volta GPUs.

And let's see what is inside the Volta GPU.

Just released less than a month ago, so it has 15 of

FP32 teraflops and what is new here, there is 120

Tensor T-OPS, so specifically designed for deep learning.

And we'll later cover what is the tensor core.

And what is this 120 coming from.

And rather than 750 gigabytes per second, this

year, the HBM2, they are using 900 gigabytes per second

memory bandwidth.

Very exciting.

And 12 nanometer process has a die size of more than 800

millimeters square.

A really large chip and supported by 300 gigabytes per

second NVLink.

So what's new in Volta, the most interesting thing for us

for deep learning, is this thing called Tensor Core.

So what is a Tensor Core?

Tensor Core is actually an instruction that can

do the four by four matrix times a four by four matrix.

The fused FMA stands Fused Multiplication and Add

in this mixed precision operation.

Just in one single clock cycle.

So let's discern for a little bit what does this mean.

So mixed precision is exactly as we mentioned in the last

chapter, so we are having FP16 for the multiplication,

but for accumulation, we are doing it with FP32.

That's where the mixed precision comes from.

So let's say how many operations, if it's four

by four by four, it's 64 multiplications then just

in one single cycle.

That's 12x increase in the speedup of the Volta

compared with the Pascal, which is released just less year.

So this is the result for matrix multiplication on

different sizes.

The speedup of Volta over Pascal is roughly 3x faster

doing these matrix multiplications.

What we care more is not only matrix multiplication

but actually running the deep neural nets.

So both for training and for inference.

And for training on ResNet-50, by taking advantage

of this Tensor Core in this V100,

it is 2.4x faster than the P100 using FP32.

So on the right hand side, it compares the inference

speedup, given a 7 microsecond latency requirement.

What is the number of images per second it can process?

It has a measurement of throughput.

Again, the V100 over P100, by taking advantage

of the Tensor Core, is 3.7 faster than the P100.

So this figure gives roughly an idea, what is a Tensor Core,

what is an integer unit, what is a floating point unit.

So this whole figure

is a single SM

stream multiprocessor.

So SM is partitioned into four processing blocks.

One, two, three, four, right?

And in each block there are eight FP64 cores here

and 16 FP32 and 16 INT32 cores here, units here.

And then there are two of the new mixed precision

Tensor cores specifically designed for deep learning.

And also there are the one warp scheduler, dispatch unit

and Register File, as before.

So what is new here is the Tensor core unit here.

So here is a figure comparing the recent generations of

Nvidia GPUs from Kepler

to Maxwell to Pascal to Volta.

We can see everything is keeping improving.

For example, the boost clock has been increased from

about 800 MHz to 1.4 GHz.

And from the Volta generation there begins to have

the Tensor core units here, which has never existed before.

And before the Maxwell,

the GPUs are using the GDDR5,

and after the Pascal GPU,

the HBM begins to came into place,

the high-bandwidth memory.

750 gigabytes per second here.

900 gigabytes per second compared with DDR3,

30 gigabytes per second.

And memory size actually didn't increase by too much,

and the power consumption is actually

also remaining roughly the same.

But giving the increase of computation, you can fit them

in the fixed power envelope that's still an exciting thing.

And the manufacturing process is actually improving from

28 nanometer, 16 nanometer, all the way to 12 nanometer.

And the chip area are also increasing to

800 millimeter-squared, that's really huge.

So, you may be interested in the comparison of the GPU

with the TPU, right?

So how do they compare with each other?

So in the original TPU paper,

TPU actually designed roughly in the year of 2015,

and this is comparison of the Pascal P40 GPU

released in 2016.

So, TPU, the power consumption is lower,

is larger on chip memory of 24 megabytes,

really large on-chip SRAM managed by the software.

And then both of them support INT8 operations,

while the inferences per second given a 10 nanometer latency

the comparison for TPU is 1X.

For the P40 it's about 2X.

So, just last week,

in the Google I/O,

a new nuclear bomb is landed on the Earth.

That is the Google Cloud TPU.

So now TPU not only support inference,

but also support training.

So there is a very limited information we can get

beyond this Google Blog.

So their Cloud TPU delivers up to 180 teraflops

to train and run machine learning models.

And this is multiple Cloud TPU,

making it into a TPU pod,

which is built with 16 the second generation TPUs

and delivers up to 11.5 teraflops

of machine learning acceleration.

So in the Google Blog, they mentioned that

one of the large scale translation models,

Google translation models, used to take a full day to train

on 32 of best commercially-available GPUs, probably P40

or P100, maybe.

And now it trains to the same accuracy,

just within one afternoon, with just 1/8 of a TPU pod,

which is pretty exciting.

Okay, so as a little wrap-up.

We covered a lot of stuff, we've mentioned

the four dimension space of algorithm and hardware,

inference and training, we covered the algorithms for

inference, for example, pruning and quantization,

Winograd Convolution, binary, ternary,

weight sharing, for example.

And then the hardware for the efficient inference.

For example, the TPU,

that take advantage of INT8, integer 8.

And also my design of EIE accelerator that take advantage

of the sparsity, anything multiplied by zero is zero,

so don't store it, don't compute on it.

And also the efficient algorithm for training, for example,

how do we do parallelization and the most recent research on

how do we use mixed precision training by taking advantage

of FP16 rather than FP32 to do training

which is four times saving the energy

and four times saving in the area,

which doesn't quite sacrifice the accuracy you'll get from

the training.

And also Dense-Sparse-Dense training using better regularization

sparse regularization, and also the teacher-student model.

You have multiple teacher on your network and have a small

student network that you can distill the knowledge

from the teacher in your network by a temperature.

And finally we covered the hardware for efficient training

and introduced two nuclear bombs.

One is the Volta GPU, the other is the TPU version two,

the Cloud TPU and also the amazing Tensor cores

in the newest generation of Nvidia GPUs.

And we also revealed the progression of a wide range,

the recent Nvidia GPUs from the Kepler K40,

that's actually when I started my research,

what we used in the beginning,

all the way to and then K40, M40,

and then Pascal and then finally the exciting Volta GPU.

So every year there is a nuclear bomb in the spring.

Okay, a little look ahead in the future.

So in the future of the city we can imagine there are a lot

of AI applications using smart society, smart care,

IOT devices, smart retail, for example, the Amazon Go,

and also smart home, a lot of scenarios.

And it poses a lot of challenges on the hardware design

that requires the low latency, privacy, mobility

and energy efficiency.

You don't want your battery to drain very quickly.

So it's both challenging and very exciting era

for the code design for both the machine learning

deep neural network model architectures

and also the hardware architecture.

So we have moved from PC era to mobile era.

Now we are in the AI-First era,

and hope you are as excited as I am for this kind of

brain-inspired cognitive computing research.

Thank you for your attention, I'm glad to take questions.

[applause]

We have five minutes.

Of course.

- [Student] Can you commercialize the deep architecture?

- The architecture, yeah, some of the ideas are pretty good.

I think there's opportunity.

Yeah.

The question is, what can we do to make the hardware better?

Oh, right, the question is how do we,

the challenges and what opportunity for those small

embedded devices around deep neural network

or in general AI algorithms.

Yeah, so those are the algorithm I discussed

in the beginning about inference.

Here.

These are the techniques that can enable such

inference or AI running on embedded devices,

by having less number of weights, fewer bits per weight,

and also quantization, low rank approximation.

The small matrix, same accuracy, even going to binary,

or ternary weights having just two bits

to do the computation rather than 16 or even 32 bit

and also the Winograd Transformation.

Those are also the enabling algorithms for those

low-power embedded devices.

Okay, the question is, if it's binary weight, the software

developers may be not able to take advantage of it.

There is a way to take advantage of binary weight.

So in one register there are 32 bit.

Now you can think of it as a 32-way parallelism.

Each bit is a single operation.

So say previously we have 10 ops per second.

Now you get 330 ops per second.

You can do this bitwise operations.

For example, XOR operations.

So one register file,

one operation becomes 32 operation.

So there is a paper called XORmad,

they very amazing implemented

on the Raspberry Pi using this feature

to do real-time detection, very cool stuff.

Yeah.

Yeah, so the trade-off is always so the power area

and performance in general, all the hardware design

have to take into account the performance, the power,

and also the area.

When machine learning comes, there's a fourth

figure of merit which is the accuracy.

What is the accuracy?

And there is a fifth one which is programmability.

So how general is your hardware?

For example, if Google just want to use that for AI

and deep learning, it's totally fine

that we can have a fully very specialized architecture

just for deep learning to support convolution,

multi-layered perception, long-short-term memory,

but GPUS, you also want to have support for those

scientific computing or graphics, AR and VR.

So that's a difference, first of all.

And TPU basically is a ASIC, right?

It's a very fixed function but you can still program it

with those coarse instructions so people from Google

roughly designed those coarse granularity instruction.

For example, one instruction just load the matrix,

store a matrix, do convolutions,

do matrix multiplications.

Those coarse-grain instructions

and they have a software-managed memory,

also called a scratchpad.

It's different from cache where it determines

where to evict something from the cache, but now,

since you know the computation pattern,

there's no need to do out-of-order execution,

to do branch prediction, no such things.

Everything is determined, so you can take the multi of

it and maintain a fully software-managed scratchpad

to reduce the data movement and remember, data movement

is the key for reducing the memory footprint

and energy consumption.

So, yeah.

Mobilia and Nobana architectures actually I'm not quite

familiar, didn't prepare those slides, so,

comment it a little bit later, no.

Oh, yeah, of course.

Those are always and can certainly be applied

to low-power embedded devices.

If you're interested, I can show you a...

Whoops.

Some examples of, oops.

Where is that?

Of my previous projects running deep neural nets.

For example, on a drone, this is using a Nvidia TK1

mobile GPU to do real-time tracking and detection.

This is me playing my nunchaku.

Filmed by a drone to do the detection and tracking.

And also, this FPGA doing the deep neural network.

It's pretty small.

This large, doing the face-alignment and

detecting the eyes, the nose and the mouth,

at a pretty high framerate.

Consuming only three watts.

This is a project I did at Facebook doing the

deep neural nets on the mobile phone to do

image classification, for example, it says it's a laptop,

or you can feed it with an image and it says

it's a selfie, has person and the face, et cetera.

So there's lots of opportunity for those

embedded or mobile-deployment of deep neural nets.

No, there is a team doing that,

but I cannot comment too much, probably.

There is a team at Google doing that sort of stuff, yeah.

Okay, thanks, everyone.

If you have any questions, feel free to drop me a e-mail.

For more infomation >> Lecture 15 | Efficient Methods and Hardware for Deep Learning - Duration: 1:16:52.

-------------------------------------------

Ohio Sen. Rob Portman: North Korea is a 'real threat' - Duration: 0:44.

For more infomation >> Ohio Sen. Rob Portman: North Korea is a 'real threat' - Duration: 0:44.

-------------------------------------------

Van flips on Interstate 71/75 - Duration: 0:42.

For more infomation >> Van flips on Interstate 71/75 - Duration: 0:42.

-------------------------------------------

Annabelle Creation Movie Review - Duration: 8:12.

Now one would think that the scariest entry

in the Conjuring Universe, like hands down,

would also be the strongest...

For more infomation >> Annabelle Creation Movie Review - Duration: 8:12.

-------------------------------------------

😃 Алексей Панин готов заменить Андрея Малахова на Первом канале в Пусть говорят #ValeryAliakseyeu - Duration: 3:09.

For more infomation >> 😃 Алексей Панин готов заменить Андрея Малахова на Первом канале в Пусть говорят #ValeryAliakseyeu - Duration: 3:09.

-------------------------------------------

Video: Tom Messner's Friday afternoon forecast. 8.11.17 - Duration: 4:03.

For more infomation >> Video: Tom Messner's Friday afternoon forecast. 8.11.17 - Duration: 4:03.

-------------------------------------------

Chest Pain & Shoulder Pain Relief - Ask Doctor Jo - Duration: 8:51.

Hey everybody it's Doctor Jo, I'm just chilling out here by the fire nothing to

do. it's the doorbell! Jack! Come on in. Hey Doctor Jo. Hey how ya doing!

Good to see ya! Everybody, this is my buddy, Jack. I met him at

YouTube's NextUp. Tell us a little bit about your channel. So my channel is

called a healthy gamer, and I make a bunch of fitness videos to help video

gamers get healthier, and strong, and confident. Awesome! So I hear you've got a

little little hurtin going on. Right so when I started exercising, I just did the

traditional bodybuilding stuff. I just did tons of bench presses, and I mean

tons of chest presses. If I would do it in the overhead, I would be like like

dumbbell overhead, but I would be on a bench, and basically doing another bench

press, so over the years my chest and my lats have become really really dominate and

really you know, too strong, and you know now I can raise my arms like this far

before they just don't go any higher. And I and I get, right, and then I

start to get shoulder pains, especially if I don't do anything anything like any

growing exercises, I just start getting real bad shoulder pains, I'm waking up in

the morning if I slept on my side. I just have especially my left shoulder, just

shoulder pain. Okay that's that's pretty common especially with lots of

lifting going on, and I think I've got some stretches and exercise for that so

let's check it out!

So a lot of lifters like Jack, when you're first starting out, you really want to pump out

that chest area, you're working on a lot of those movements and you kind of miss

out on the back, and then you end up having those shoulders roll forward

because you have those tight pecs and then when you have those tight pecs, when

you roll forward, then you end up having a lot of those shoulder issues as well.

So we'll stretch out the chest, and we'll stretch out the shoulders, and then we'll

work the upper back a little bit to strengthen that and get you back in a

good proper position. One of the best chest stretches that I like is using a

foam roller. If you don't have a foam roller, you can just roll up like a beach

towel just to give you a little bit of a lift. So what you want

to do is put it in a vertical position. It's going to be where your spine is, so

you're just going to kind of roll onto it and make sure that your head is

supported so then your neck doesn't fall off of it and get super extended, and

then you're just going to bring your arms up. Come into a stop sign position,

and then let those elbows drop, and so that's really gonna open up that chest,

stretch those pecs muscles, all those chest muscles, and you can hold it for

about thirty seconds. You can change your position a little bit just to get the

different muscles, so you can go up higher, you can kind of go into a wide

position, but I like starting off with this one and just letting those elbows

drop. So again holding that for 30 seconds, and then doing that three times.

I also like, to help stretch the shoulders and a little bit of the chest area, is to

go into a prayer stretch. For the prayer stretch, you want to get on your knees

and you're going to bring out your arms as far as you comfortably can, kind of

placing your head on the ground. For getting a little bit of the lats to

stretch as well, you can go in an angle, so I'm going to do all those. So just

starting off, you're gonna kind of go into that prayer stretch, and you can

creep your hands forward, and you'll feel it in the shoulders there to get more of

that shoulder flexion, and then coming back up. You can just hold it 10 to 15

seconds or so, and then go add a little bit of an angle to get more of the

shoulder and to get those lats a little bit, so then you're going to come off to

the side this way, so again holding 10 to 15 seconds, you can work your way up to

30 seconds if you want to, but you don't have to.

Coming up, and then just going straight down. So if you have a chair that has an

opening like this one, this works great because you just come back and kind of

grab on the seat back there. You want to keep your upper body straight, keep your

chin kind of tucked in, not looking down, and just lean forward so that's gonna

get a nice stretch in your shoulders and your chest area. If you don't have the

opening here, but you have a backrest where you can put your arms around you

can do that as well, so just kind of put it back and then still just leaning

forward, so that'll still open up that chest and stretch those shoulders a

little bit too. You can also just kind of put your elbows there and come forward

as well and that again kind of stretches that chest and PEC area, so just doing

that, leaning forward as much as you can, but I really like the one that has the

opening because I feel like you can get a better lean with that. So that works

really nicely as well and holding those if you can for about 30 seconds and

doing three of those. So the next stretch again will stretch the shoulders and the

lats a little bit, so get yourself into a lunge position, and then what you're

gonna do is you bring up the arm of the side that you want to stretch, and take

the other hand and kind of place it below so you're going to come up and

then lean over, so almost like you're taking your hand towards the other

corner of the ceiling, and if you want to get your hip flexors a little bit, you

can lean forward and get that stretch as well, but you're coming up and over and

kind of pushing that other hand towards that side to get that stretch. So again

holding that for 30 seconds, and then doing that three times on each side. So

I'll show you forward as well so you can kind of see that lean, so this hand

across this one, up and over, and then kind of pushing that hip out a little

bit will give you that extra stretch, so almost like you're reaching towards the

ceiling on that side and kind of pushing that hip out so you'll get that stretch

all the way through here, and then the next stretch is going to be with the

Swiss ball. Look at that it just showed up! So again

with chest stretches, there's a whole lot of ways you can do it, you can do it on

the foam roller, you can do it in the corner, and stretch that when people do

that one. But you can also stretch it on the Swiss ball. So just kind of

leaning on it where it's on your mid-back, so you have a little room to

let your your chest and your shoulders open up a little bit, and the same kind

of thing, like when on the roller, start off with your hands kind of in that stop

sign and just let them stretch, and if you want to get a different position you can

come up a little bit and then just let those arms hang down. And this is nice

because you can just relax on the ball and let that chest open up so holding it

for 30 seconds and doing that three times. Alright now we're gonna do some

awesome exercises, so these exercises are for the upper back they're gonna be the

Is, Ts, Ys and Ws. A lot of times you'll see people do them just

lying on their stomach flat on the table, but doing it on a Swiss ball makes it a

little bit harder because you have this unlevel surface so it works your core as

well, it makes your core and your upper back work together, so I really like it

on the Swiss ball. So you're gonna get it kind of in this position where you're up

on your toes because once you get on the ball you want to bring your knees off

the ground, that's what's gonna make it a little bit harder, and have you have to

work your core a little bit. So trying to get it a comfortable position on the

ball, and as you can see I'm on my toes my

knees are off the ground, and then you're going to start off in that I position

with your thumbs up, and just bring them up this way, so just starting off with

about ten of those and then you're going to come into a T. Still with those

thumbs up, and come really trying to squeeze those shoulder blades together

doing ten of those. Then you're going to go into a Y position. Same thing

trying to squeeze those shoulder blades together, and then the last one is the W.

So bring your elbows bent now and put your palms down and then still trying to

squeeze those shoulder blades together. So ten of each of those. So there you

have it, those are some stretches and exercises for the chest, shoulders, kind

of back area, and hopefully they will get you to loosen up and get ya back in all the

gaming without hurting. So don't forget to subscribe to Jack's channel, and don't

forget to subscribe to AskDoctorJo, and remember, be safe, have fun, and I hope you

feel better soon.

For more infomation >> Chest Pain & Shoulder Pain Relief - Ask Doctor Jo - Duration: 8:51.

-------------------------------------------

Podcast: Digital Programs with Human Coaching Reduce Diabetes Risk | Humana - Duration: 13:06.

For more infomation >> Podcast: Digital Programs with Human Coaching Reduce Diabetes Risk | Humana - Duration: 13:06.

-------------------------------------------

Wolverine Attack 'Weapon X Facility' Scene | X-Men Apocalypse (2016) Movie Clip Blu-ray 4K - Duration: 2:55.

There's some kind of animal in there.

It's no animal.

It's a man.

Who is he?

That part of him as been taken away.

What do you mean?

I mean they turned him into some kind of weapon.

Charlie Three, report. Did you find them?

Shut it down.

Fire!

Weapon X is loose!

I repeat, Weapon X is loose!

You sure he's not an animal?

-He's into the central halls. -Engaging target Sector Five.

We can't...

Hey!

-What are they doing? -What's going on?

-I just lost feed. -Hey! What's going on here?

I'll be right back.

No!

For more infomation >> Wolverine Attack 'Weapon X Facility' Scene | X-Men Apocalypse (2016) Movie Clip Blu-ray 4K - Duration: 2:55.

-------------------------------------------

Top 10 Best Reasons Why Girls Suck - Duration: 3:05.

Top 10 Best Reasons Why Girls Suck

For more infomation >> Top 10 Best Reasons Why Girls Suck - Duration: 3:05.

-------------------------------------------

¡Diviértete con el resumen semanal de Un Nuevo Día! | Un Nuevo Día | Telemundo - Duration: 3:54.

For more infomation >> ¡Diviértete con el resumen semanal de Un Nuevo Día! | Un Nuevo Día | Telemundo - Duration: 3:54.

-------------------------------------------

Jean Grey's Nightmare | X-Men Apocalypse (2016) Movie Clip - Duration: 3:03.

Back to bed, please, my darling. Back to bed.

She's doing it again.

Back to bed, please, everyone.

Jesse, back to bed, please.

Come on now. Spit-spot, back to bed.

Carrie Anne, come on now. Back to bed.

Never seen it like this.

Nor I.

Don't let any of the children come this way.

Jean.

Jean!

I saw the end of the world.

I could feel...

all this death.

It was just a dream.

No, it felt real.

I know.

Your mind is the most powerful I've ever seen.

It can convince itself...

No, no...

It's not just the mind-reading or the telekinesis...

it's something else.

Some dark power inside and it's growing.

Like a fire.

I thought I was getting better.

You are.

You will.

You just have to be patient.

No, no. You don't know what it's like to be afraid to shut your eyes.

To be trapped inside your own head.

Oh, I think I do.

It wasn't so long ago that I was plagued by voices, myself.

All their suffering.

All their pain.

Their secrets.

I'm afraid one day I'm going to hurt someone.

Lie back.

Everyone fears that which they do not understand.

You will learn to control your powers.

And when you do...

you'll have nothing to fear.

For more infomation >> Jean Grey's Nightmare | X-Men Apocalypse (2016) Movie Clip - Duration: 3:03.

-------------------------------------------

¡Le declaran la guerra a la adicción a los opioides! | Un Nuevo Día | Telemundo - Duration: 2:49.

For more infomation >> ¡Le declaran la guerra a la adicción a los opioides! | Un Nuevo Día | Telemundo - Duration: 2:49.

-------------------------------------------

Descubre un campamento de verano para niños especiales | Un Nuevo Día | Telemundo - Duration: 3:29.

For more infomation >> Descubre un campamento de verano para niños especiales | Un Nuevo Día | Telemundo - Duration: 3:29.

-------------------------------------------

¡Rashel Díaz llevó a Daniel Sarcos a Caso Cerrado! | Un Nuevo Día | Telemundo - Duration: 2:03.

For more infomation >> ¡Rashel Díaz llevó a Daniel Sarcos a Caso Cerrado! | Un Nuevo Día | Telemundo - Duration: 2:03.

-------------------------------------------

Kyrie Irving Ignored Teammates for Days During NBA Playoffs - Duration: 2:54.

What's up, guys?

Speedy Mormna here for Complex News.

Kyrie Irving reportedly wants the Cleveland Cavaliers to trade him this offseason, and

everyone seems to have a different reason for why he wants the team to do it.

Some NBA reporters and analysts have speculated that Irving hasn't gotten along with LeBron

James in recent seasons, while others have suggested that Irving might be looking to

step out of LeBron's shadow and run a team on his own.

There are so many rumors surrounding Irving's trade request that it's gotten difficult

to keep up with all of them.

Irving himself is yet to speak publicly about the trade that he reportedly asked for, but

the latest report on it indicates that his relationship with his Cavaliers teammates

might be even worse than some people have thought.

And it could be a sign that Irving is definitely on his way out of Cleveland in the coming

months.

ESPN reporter Dave McMenamin, who covers the Cavaliers for The Worldwide Leader and has

his finger on the pulse of what's happening with the team at any given moment, appeared

on the BBall Breakdown podcast this week, and he was asked about the idea of the Cavaliers

trading Irving to the Suns as part of a deal that would see Eric Bledsoe, Josh Jackson,

and a draft pick coming back to Cleveland in return.

It's an interesting trade proposal because the Suns have reportedly been hesitant to

deal Jackson, who they drafted back in June.

But it's also interesting because Irving's former Cavaliers teammate James Jones—who

is close with LeBron—agreed to become the Suns' new vice president of basketball operations

in June after announcing his retirement.

And while talking on the podcast, McMenamin said that could play a key role in whether

a Cavaliers/Suns trade for Irving gets done, because according to him, Irving often refused

to talk to Jones and other Cleveland teammates during the team's playoff run this past

spring.

"Phoenix, of course, hired James Jones this offseason.

He's been inside that [Cavaliers] locker room.

He saw Kyrie in the playoffs this year—in between the first round when they beat Indiana

and the second round when they played Toronto—go consecutive days without speaking to a teammate

at practice.

On that stage."

McMenamin went on to say that it's not all that uncommon for NBA players to occasionally

distance themselves from teammates at times during the regular season.

But the fact that he heard Irving did it during the postseason is something that could be

a concern for Jones, if he wants to bring Irving to Phoenix.

"It's one thing for people to go through your ups and downs during the regular season.

But when you get to the playoffs, the main thing is the main thing, and we're brothers,

and we're pulling together to get this thing done.

Even at that level, there were things that made him sullen or reclusive from his teammates."

It's worth pointing out that the Cavaliers swept both the Pacers and the Raptors in the

first and second rounds of the playoffs.

So if Irving did indeed stop talking to teammates at some point between those series, it doesn't

seem like it had much of an effect on the team.

But Irving's reported behavior during the playoffs could prevent the Suns from pulling

the trigger on a blockbuster trade that would bring Irving out west.

It could also make it more difficult for the Cavaliers to find the kind of package they're

looking for in return for Irving.

For now, Irving continues to stay silent with regards to his trade request.

But if we've learned anything over these last few weeks, it's that the trade rumors

surrounding him are going to continue as we inch closer and closer to the start of NBA

training camp in September.

That's the news for now, but for all the latest news on Kyrie Irving, subscribe to

Complex News on YouTube.

For Complex News, I'm Speedy Morman.

For more infomation >> Kyrie Irving Ignored Teammates for Days During NBA Playoffs - Duration: 2:54.

-------------------------------------------

Primitive girl Catch fish by Hand - How to catch fish traditional style - Duration: 10:20.

Thank you for watching video

If you very like video on subcribes channel HVC FISHING

For more infomation >> Primitive girl Catch fish by Hand - How to catch fish traditional style - Duration: 10:20.

-------------------------------------------

8 Ball Pool - MY ALL 6 ACCOUNT GOT BANNED WITH REAL ACCOUNT & VISIT MY NEW ACCOUNT 2017 - Duration: 1:35.

For more infomation >> 8 Ball Pool - MY ALL 6 ACCOUNT GOT BANNED WITH REAL ACCOUNT & VISIT MY NEW ACCOUNT 2017 - Duration: 1:35.

-------------------------------------------

¿Existe realmente esa persona que sea "tu media naranja"? | Un Nuevo Día | Telemundo - Duration: 6:31.

For more infomation >> ¿Existe realmente esa persona que sea "tu media naranja"? | Un Nuevo Día | Telemundo - Duration: 6:31.

-------------------------------------------

🍁 Primitive Technology Fishing | Cambodia Traditional SpearFishing | How to Catch Big Fish - Duration: 11:00.

thank you for watching video, subcribes channel hvc fishing

YT for next

Thứ Sáu, 11 tháng 8, 2017

Youtube daily Aug 11 2017

Không có nhận xét nào:

Đăng nhận xét