>> OK.
Hi everybody.
This is Bowerbirds of Technology.
I am Sam.
That is my Twitter handle.
If you'd care to send questions, comments, complaints, thoughts on anything.
This is a bowerbird.
Bowerbirds live in Australia, where they build structures to attract mates.
Please enjoy the nice photos of birds that I found on Flickr.
Going to open with a quote.
Speaking of Flickr, Cal Henderson worked there and I think this was in Djangocon 2008.
It turns out that all of -- I really can't do math today.
I'm going to reset.
Most websites aren't in the top 100 websites.
Just read the slide, Sam.
It turns out that all but 100 are not in the top 100 websites.
There we go.
OK, this is a Zipf distribution.
It states that whatever you pick is inversely proportional to rank.
It looks like a straight line, but the axes are log log.
Empirical studies have shown that this holds true pretty much everywhere so what I'm trying
to make the point here is that most of us except for the people who work at Google -- hi,
Liz! -- are not Google, and that is OK.
Because tons of products do just great at not-Google scale.
I'm also not Google.
This is a different one of the other 7 hair colors from the last two years.
I currently work at a company called Nuna, where we do some data processing for health
insurance companies and self-insuring: And before that I was at Twilio.
Roughly a decade on large and fast-growing, but not Google-scale web services.
So I'm going to talk today talking about what are Google's problems what are Google, Amazon
and Facebook thinking of, and that's this.
Extremely high through-put, tens of thousands of servers in dozens to data centers across
the world, thousands of engineers and the ability to specialize some of those engineers
into extremely small niches, and unlimited resources ...:
Wow, I was supposed to click through those, sorry.
OK.
Case studies: Uber a this is not an endorsement of any of Uber's behavior towards human beings.
[laughter]
But Uber ran into some schemaless issues.
And I thought this was interesting.
They built themselves a new data store.
And things they wanted were things like being able to linearly add capacity by adding new
servers.
The decision to favor write availability or read-write semantics.
They wanted event notifications or triggers.
They said in blogpost on this stuff, they said we had a system Kafka ... Have you tried
updating?
So what is this then?
What is this schemaless thing that they built.
Quote it is an append-only sparse -- I have only one reply to this.
This comic is famous enough that you hopefully recognize this.
But here it is.
How do I query the database?
It's not a database, it's a key-value store.
OK, it's not a database, how do I query it?
You write a distributed function in Erlang.
Did you just tell me to go ... Myself?
I believe I did, Bob.
>> So doing things like this has a cost and the of that is in new abstractions.
The boundary between the app and the database changes, because the app has to know about
schemas and know about persistence levels.
You can't read things you just wrote, right?
Eventual consistency is really hard to reason about.
Neither can other processers read things you just wrote.
And you definitely don't get joins.
And Uber also gave up developer familiarity, right?
They can't hire quickly for this thing and they can't ramp quickly.
People aren't going to walk through the door knowing how to use schemaless, and good luck
with contractors.
Next stop, Amazon and their service architecture.
I'm guessing a lot of people are familiar with Steve Yegge, but accidentally went public
in that process, a long rant at Google about how Amazon Web Services was going to eat Google's
lunch and the main focus of this rant was how Amazon ended up with a serverless architecture.
Amazon has I think famously profited from going all in on serverless architecture, super-early
Jeff Bezos put out the following mandate.
Teams must communicate with each other through these interfaces.
There will be no other form of interprocess communication allowed.
No database calls, no library linking, the only communication allowed is via service
interface calls over the network.
It doesn't matter what technology you use, Bezos doesn't care.
All service interfaces without exception must be designed from the ground up to the externalizable.
And anybody who doesn't do this will be fired.
[laughter]
So this is how Amazon decided they were going to make systems and developer teams to scale.
They were, and continue to be extremely serious about this.
But they also learned some things, by Steve was also kind enough to tell us in his accidental
blogpost.
They learned that pager escalation gets way harder, because a ticket might bounce through
20 service calls before the real owner of the problem was identified.
Amazon learned that every single one of your peer teams suddenly become a potential denial
of service ha tacker.
So you can't do anything until you have rate-limiting queuing in place.
That monitoring and queuing are the same things, because sometimes the only thing still functioning
in the server is the little component ... and they learned that once you have all these
services, you won't be able to find any of them without a service discovery mechanism,
which is itself another service.
So what I'm trying to say here really is that having massively scalable infrastructure costs
developer time.
Right?
That's the crux of this.
And you, as somebody who is not Google, don't have a lot of that.
I have one more story here.
Let's talk about users.
Talking about #newtwitter.
Hands up if you were on Twitter in 2010.
OK, you recall when the URL had hash bangs in them?
So Twitter went to single-page apps really early.
Oh, that's a lot better.
Get that in place.
They were talking about rich applications like we heard about this this morning and
before everybody got to the universal apps that we were talking about.
You had to pull down the whole JavaScript before content could render, right?
And also didn't have the history API.
HTML5 gave us this nice new API where you could push stakes onto the history stack API.
It wasn't here, so hashtags happened.
And this had some problems.
That the part of the you recall after the hash, the fragment that identifies the specific
content doesn't even go in the HTTP request, and this was Dan Webb who was at Twitter at
the time told us that URLs are important.
And URLs are forever and cool URLs don't change.
Other things are you're stuck running some JavaScript on the root document in your domain
forever.
Because people would take links with the hashtags in them.
So you can rearchitect, you can rebuild, you can go to a universal app, but you're stuck,
something has to be there waiting to parse those forever.
And finally they had performance issues, right?
This is still more from Twitter themselves when they confessed to this when they undid
it a year and a half later, because again, you don't see anything until the whole JavaScript
is downloaded on the page.
This is made worse for people who have access to the latest and greatest technology and
so this is impact socially.
If you don't have a brand new iPhone or a brand new MacBook, your website could run
more slowly.
And I think this has been obviated a little bit because computers have gotten faster.
But they learned a lesson that when it comes to your users picking the latest and greatest
technology can leave them behind, too.
Not all users are alike.
The deploy base is so widely varied because of affordability, because a lot of times we
don't explain our technology well enough to people that they can't understand what's going
on.
Accessibility concerns, it's really great that we have companies here.
Some conferences don't.
A lot of websites have massive accessibility issues.
Corporate IT possibilities, right, if you're stuck with a Sass application that runs with
ActiveX controls, well, good luck.
Things don't have long-term support and those are some points and some of you probably put
your hand up and say, well, I want to be Google.
But even Google wasn't Google overnight.
And more quotes.
Ben Gomes got interviewed by read-write.
And he said in 1994.
Google is servicing 10,000 searches per day.
OK, so 2006, 7 years later, Google is now serving the same 10,000 queries every second.
And in 2012, Google can index those pages out of billions more in one minute.
But sometimes Google still is even not Google.
I have it on good authority there are things internal it's not Liz, sorry, I looked at
you but that is not the authority here.
They don't have big tables, they don't have to.
And this just goes to show you that quote-unquote boring technology
can go really far.
Last year at Pycon...
... When I worked at Twilio. ... this was stuff
that people had been doing for 15 years, that goes 20,000 writes per second and most importantly
still get full acid compliance, so the application can not have to think about things like integrity
constants straints.
And more on this, right if you and your team and product are lucky enough to experience
the joys of exponential growth, the first part of that curve is gentle enough to give
you warning, right?
There's no single point where the system is going to all at once say this is too many
requests.
I was fine at 9,999, but 10,000 requests per second, I can't do that, I'm shutting down.
They just slowly start degrading.
So as you're scaling, find the thing that's most on fire.
Evolve it to better cope with new scale, replace it if you have to, and repeat.
OK, so hopefully those of you who are saying, but I want to be Google, you can still say
I am OK, I'm not Google yet.
What does that mean?
And I'm going to say that you should worry about user trust above all else.
Maintain your users' trust and meet their needs.
Through things like fast safe iteration.
So move fast without breaking things.
Your developer team's time is one of your best resources so make the most of it.
And you also want healthy teams.
I'm going to talk about on call in a little bit but there are plenty of things to consider
here, like inclusivity.
So here's the metaphor.
Let's be Bowerbirds.
We don't have to reinvent the wheel.
We should build bowers instead.
Let's say that our environment, our found environment is the modern software ecosystem,
so open source vendors, just off the shelf software solutions, anything you can find
that's already constructed, how can we put that together?
We want to have healthy relationships through our users and our team, so how can we find
what we need and combine it to build a beautiful bower of technology to make both our users
and other developers happy.
So with this tortured metaphor, we want to talk about technical decisions in this framework
and how to run a team and how to run your business with this stuff in mind.
So first, technology.
Let's talk about picking technology.
We need a bottlecap, it looks like.
This bottlecap could be a database, it could be a web server, it doesn't really matter.
But what we want to think about when we're looking for bottlecaps?
We want to think about how mature the project is, and right now I'm talking about open source
for the most part.
Something you can adapt without paying for it.
And it means that you should think about things that aren't brand spanking new, right, only
have three commits on GitHub, but at the same time are not on Apache attic, where things
go to die.
You want the maintainership.
You don't want it to come from the company that threw it out if it came from inside a
large company.
If the project isn't big enough to have a software foundation of its own.
What's the relief philosophy for that maintaining team look like?
Security is a big one.
But quickly search for CVs, right look for exploits that are known for this thing.
Count them.
Figure out how quickly did they show up.
Are people looking at this and actually reporting new ones?
When they do show up, how quickly are they resolved?
Are they resolved at all?
How hard is it going to be to deploy this thing?
To go with security, we have stability, there's two types here, right?
I think about API stability, so is Version 2.0 going to come out and break all of your
library calls?
And system stability.
So does the database actually database?
Also within open source software, want to talk about the project ecosystem, right?
So this could mean a lot of things.
This could be the library support for your languages, are developers familiar with this
thing?
How fast are people able to ramp up on this thing?
Can you find consultants if you really need people in a pinch.
If you pick tech that everybody knows, that means you're not going to have to wait three
ponts for your new dev to be productive on the stack.
A friend of mine Josh refers to this as out of the boxiness.
Or you can also think of it as friction, right?
Are there Docker files if you want to deploy to a Docker H is a chef cookbook if you're
using chef things.
Right?
How easy is it to get off the ground with.
So a bunch of questions here.
So are the docs existent?
Are they up to date?
Can you actually get what you need or are you going to write it yourself.?
And support and consultants.
Can you get a support contract from somebody from when your main database dies at 1 a.m.
and your backup turns out to be corrupt, you probably want help.
And finally, there's licensing land mines to be aware of.
If you're writing mobile apps be careful, because GPL can't go in the Apple app store.
And Apache declared that everybody who wants to use their software in it are no good.
Facebook ended up having to relicense React.
So that's open source, mostly.
obviously our two choices are not just open source software or write it ourselves.
We can pay money for things.
What's it going to cost to build the thing?
And how long is it going to take?
What do you lose in the meantime to in the having this thing tomorrow?
And keep in mind that there's two forms of cost and time here, right?
It's not only do you not get the shiny tomorrow.
You have to choose something else not to build because you're using up some of your developer
time.
How hard is it to replace a vendor, right?
What happens if spontaneous massive vendor existence failure occurs, the link between
your network and data center goes down or if they go out of business.
These are all things you should keep in mind before you sign a C OK.
Now on to teams.
How do we run our services?
How do we run or projects and businesses?
What do our relationships with our customers that we care about our customers and our users,
because we want to have a billion of them, so we can be Google, right?
What do we with a care about our customers?
How do we tell them that we hadn't them to trust us and what does a healthy team look
like?
So teams first, I'm going to talk about a few things here and I think there will be
echoes of things that has been said in greater detail earlier today because this is a wonderful
conference to for that.
We're going to talk about on call first.
The industry has a problem with on call and pager rotations.
There's an extreme here that's far away from what we do as software people and if you think
about industry is where life safety is critical, right, there's a completely different on-call
model, we're talking about hospitals, people who work at nuclear power plants.
They do it by this.
There is 168 hours in a week.
A standard work week is 40 hours, that's 4.2 people.
Well, unfortunately, I haven't seen.
Well, crap, if you do the math backwards, that's only 32 hours per week.
Put yourself around up if you actually staff of developers to keep your site up.
For the video there are exactly zero hands in the audience.
I don't think I would do this myself.
What is a happy medium?
How do we make on call less awful.
So we need to empower things to be not robots.
Employing humans to be robots is bad, so we should do less of it.
On-call's job and I think we heard about this earlier, on calls should hopefully be paged
at 2 a.m. maybe once a week or ideally once a month.
Find the thing that broke and sure it doesn't do it again.
And your on-call and anybody participating in it needs time and space and authorities
to do this, so give it to them.
We should also pick appropriate levels of availability and scalability.
You're not going to go from 10,000 requests per day to 10,000 requests per second overnight
or even in a year or even in five years.
You should know what you're able to do capacity-planning-wise.
And I like to say you should have a plan of 10X scale and ideally how you're going to
get to 100X.
Telecom, where people make 911 calls, most contracts will find have an SLA that they
get money back if you go below three and a half nines.
Here's the math again in case we didn't get enough of it earlier this morning.
Two nines is 3.65 days per year.
Three nines is just under 9 hours per year, 1.5 minutes per day.
Four nines is 8 seconds per day five nines if you really want to get extreme ... is 5.26
minutes per year.
Under ten seconds, that's not fast enough for a human to do anything.
So one, do you need that SLA?
And two, yeah, just don't overcommit yourself.
Odds are your users won't even notice.
If you take it offline over the weekend to do a database migration.
OK, so yeah, humane on-call schedule.
And sensible alerting to avoid fatigue, you get this culture where people who might otherwise
not have joined can show up and participate in our teams.
Which is great, people with children, people who are disabled, etc.
This goes hand in hand with the next thing which is building a safe and inclusive work
environment.
So let's talk about psychological safety.
This is came.
You need an environment where people can be comfortable and safe disagreeing with each
other without having discussions about feeling they're going to be suffering negative consequences.
And that means inclusivity.
It's not just diversity.
Diversity is not enough.
People with differing backgrounds need to be comfortable being themselves.
And that means things like setting ground rules, make a team charter.
This is not all-inclusive: code reviews, right?
Have rules what you say in them, don't feign surprise at people, meeting etiquette don't
interrupt anybody.
If give credit to people, right, so if somebody
speaks up and her idea gets ignored, and then somebody else brings up the same idea, later,
go excuse me, I believe that was Jane's idea.
Pass that forward.
People need space to -- don't feign surprise.
Never say, oh, my God, I can't believe you don't know how to do that.
That's the worst thing you can do to somebody who's trying to learn.
Don't tell people to read the fine manual.
And you need to have rules around how you resolve conflict.
Be aware of and take steps to address conscious and unconscious bias in your organization.
OK.
Click?
Oh, and it's not enough to just hire women and people of color or disabled people, you
have to ensure that all these populations who are underrepresented in our industry have
equal access to growth opportunities, right?
So make sure you're counting how many people you're promoting and look at the ratios there.
Look at your upper management.
If you have all white men in your directorate, you need to fix that.
Please, you can find consulting firms to help with this.
Don't do it all yourself.
You can engage one of these firms and pay them.
Because it's wrong just to make the few underrepresented people in your company that you do have do
this work as an unpaid side gig to what they came to your company to do, which is write
code.
So make your on call reasonable, make your teams inclusive and safe for everybody, because
at the end of the day, we work with humans, first.
And that brings me to other humans which are your users, so how do you keep your users
happy?
You have to have empathy for your users.
It means knowing your tech base, right?
So there are tradeoffs in terms of how detailed you can get in terms of capturing the stories
of people who are using your product.
Get the user data, get the stories about what people are using, know your impact when you
make changes or when you have outages, especially when you have outages, right?
So understand like if you're telecom, when you go down, you may be preventing people
from calling 911 whereas if you're Instagram, that may be OK.
And use that empathy.
Set expectations ahead of time.
You can also degrade gracefully, right so coming back to new Twitter for a second, these
days it's really more of a universal application, right?
You get a skeleton with a tweak you asked for and the JavaScript enhances it.
Netflix they fall back to default recommendations in case the recommendation engine happens
to be down when you open the app.
So you can still watch movies, maybe not just the movies you were looking for right away.
You need to overcommunicate.
Particularly when there's a problem.
So update your status page when you think there might even be an incident.
And speaking I love this story, remember last February when S3 died?
This happened for rile real: We were unable to update the individual service's status
on the AWS service health dashboard, take that in nor a second.
If you're running on Amazon, put it on Google Cloud platform.
If you're on Azure, I don't know, go to IBM.
they put the Google doc with their incidence notes, they made it public.
So everybody who was using them could watch their engineering team work in real time as
they were working to solve this thing.
Really great use of transparency.
Overcommunicating, still.
Put them in your Zendesk or whatever you're using and listen to them when they're telling
you about their problems.
You can measure support performance.
You can have the time to first response in your support tickets, you can think about
the time to resolve your support tickets and then also send out surveys where you get a
satisfaction score.
You can measure support as well as your services.
I'm running out of time for all these slides.
Disaster recovery, because it will happen.
Really quick, identify fault domains: Fault tolerance has costs, right?
Practice your failover and backup recovery ahead of time, in controlled conditions, you'll
thank me later.
Security: Very quickly, look apartment the open web application security guide.
What are your assets that somebody might want to get at.
How are they getting in to take over your assets and how can you mitigate those vectors
of attack.
Don't do this.
Don't do this, either, and finally on security, communicate, because you could treat them
like any other incident.
The longer you keep a security problem a secret, the worst that the backlash is going to be.
Think Experion.
No, we may not all be Google or Facebook, but we can all learn from their paths and
we can all adopt code and ideas from them and build amazing Bowers of technology for
our users.
Finally I had to close with this, the metaphor I came up before Bowerbirds was dung beetles,
so aren't you glad you have pictures of Bowerbirds instead?
Thank you! [Applause]
Không có nhận xét nào:
Đăng nhận xét