so hello everyone I'm Jennifer Petoff and today I'll be giving an introduction
to site reliability engineering as its practiced at Google a quick a quick poll
of the audience so who here is already familiar with Sree okay
fair number of folks who here considers themselves a practicing site reliability
engineer fewer folks okay so for those of you who are less familiar with Sree
this talk will hopefully provide a good overview of the principles practices and
cultural cultural elements of essary for those of you who are already familiar
much like DevOps sre can mean different things to different people so hopefully
there's some stuff up here that resonates with you and if there are
things you do differently in your world love to have a dialogue about that and
you know we can learn from each other all right so why am i up here speaking
today so just a few a few facts about me I've been at Google for 11 years now
another yank up here on stage like what are the odds two in a row I've been
however I have been based in Dublin for the past eight years I'm a senior
program manager on the SRE team and have been for the past five years or so so I
lead the sre education program at Google and I'm one of the co editors of the
original SOE book that we published back in 2016
yeah and you can find me on Twitter at Jen's key if you'd like to follow along
a couple fun facts my nickname is actually dr. Jay and
that's because I have a PhD in chemistry I also love to travel and I'm a
part-time travel blogger at sidewalks sorry so if anybody wants to chat about
that in the hallway track just a few fun hooks their problem is the microphone is
here so all right trying to make sure that the the caption the captioning
folks can hear me is it any better any better do you have the handheld mic if I
can use the handheld and then aim my voice at you a little better all right
we'll go karaoke style we'll make it work
all right so in this talk we're going to cover what is essary some key principles
of site reliability engineering the practices of sre and just a little bit
at the end on how to get started so what is sre so I'll actually start by talking
about the core problem that sre tries to solve and we'll talk a little bit about
the high level organizational structures that actually facilitate the practice of
sree so it can be useful to actually talk for a moment about the operational
cost of software and the long term costs associated with developing software so
in general software engineering as a discipline really focuses on designing
and building rather than on operating and maintaining software despite the
fact that perhaps up to 90% of the cost of a piece of software is incurred after
after launch and in a lot of organizations this means that you know
the focus is on hiring great developers and operations can be can be an
afterthought and you know so you're running the software is someone else's
problem and this is not a great situation to be in so in what this what
this means is that the incentives really aren't aligned
you've got developers who want to run fast they want to launch things they
want to be as agile as possible and then hey thrown this over the wall to the
operators who are you know tasked with with running the software and who are
responsible for you know scalability reliability and maintenance of that of
that software so again this this is a pretty terrible set up not designed for
harmony it's pretty brittle and doesn't scale doesn't scale well so let's talk
about some of the techniques that we can use for for dealing with this and
reducing a product lifecycle friction so in terms of breaking down silos between
the business and development agile really solves for this and to some
extent and then if you look at the bridge between development and
operations DevOps also the goal is to break down the silos between those two
those two boxes DevOps actually came around around the
same time as essary site reliability engineering actually predates DevOps by
about 10 years but it was only in 2016 that we started talking about it
externally but both both actually try and solve the problem of how do you
resolve that tension between development and operations to achieve better
reliability and agility okay so you're depending on who you talk to the
definition of DevOps might look at where you've got a lot of action happening
here okay so depending on who you talk to you'll
get a different definition of what DevOps actually is for purposes of this
discussion we consider DevOps a set of practices guidelines and culture
designed to break down silos and IT development operations architecture
networking and security and then the five key areas of focus of DevOps is to
reduce organizational silos accept failure as normal things are going to go
wrong it's just a matter of when implementing gradual changes leveraging
tool and automation tooling and automation and measuring this culture of
measuring everything so now let's talk a little bit about the site reliability
engineering approach to operations so one of the key elements here is to use
data to guide decision-making and take some of the emotion out of
decision-making we also treat operations like a software engineering problem so
we tend to hire people who are motivated and capable of writing automation my my
boss was actually quoted in the Irish Times as saying you know Google hires
lazy engineers like we're looking at people that want don't want to do the
same thing over and over again they want to automate themselves out of a job
using software to accomplish tasks that would normally be done by sis admin's
and the other thing that s arena focuses on is really designing more reliable and
operable service architecture from from the very start and in terms of what sre
teams do so there's really this focus on developing solutions to design build and
run large-scale systems yet we want to do it scalable e we want to do it
reliably and we want to do it efficiently we operate at this in
based between software engineering and systems engineering so guiding system
architecture sree we consider SOE a job function on
mindsets and it's actually a set of engineering approaches to running better
better production systems and we like to describe SRS as you know being
constructive pessimists so we hope we hope for the best but we recognize that
hope is not a strategy and thus we plan plan for the worst okay so how does this
all tie in with with DevOps we are at a DevOps day conference after all so we
like to think about this as class SOE implements implements DevOps
so if DevOps is this set of practices as guidelines culture designed to break
down silos etc etc site reliability engineering is a set of practices that
we found to work some beliefs that animate those practices and a job role
so one of my colleagues has quipped it's sort of an opinionated implementation of
DevOps to a certain degree and again the key areas of focus are actually actually
the same all right so what is this area at its core let's talk about what you
could argue is the one the one key principle of site reliability
engineering I thought the whole discipline is sort of built around and
that's this concept of error budgets so to talk about error budgets we need to
talk for a moment about reliability and specifically how you actually go about
measuring measuring reliability you could take a naive approach to this so
you know availability is simply when things are good the amount of time that
things are good divided by the total time that you're
measuring so so what a time what amount of time is yet the fraction of time that
your service is available in working its intuitive for humans and you know it's
it's easy enough to measure if you're looking at like is the system up or is
the system down like up sort of a binary measurement but if you're running a
distributed system it's a bit trickier so like what about the case where you
know is the server up or down if it's not currently getting requests and then
if one of three service servers are down is the service up up or down so there
can be some ambiguity here so we can talk about a slightly more sophisticated
approach to measuring reliability and this is simply looking at the number of
good interactions divided by some total number of interactions with your system
so which fraction of your actual users experience your system is being
available in working so this actually works well if you're running a
distributed system and enables there was sort of slightly more ambiguous cases
that we talked about on the last slide
so this just gives a picture yeah it's important to realize that it's not just
about your system being being hard down there's you know more nuance here we've
got we've got a list up here if you're thinking about sorry yeah it's really
important to think about the amount of downtime that you're actually willing to
tolerate for your system we talked a lot about you know are you running a three
nine service a four nine service a five nine service if you're running a three
nine service you're talking about less than nine hours of downtime per year and
then if again if you're running a five nine service so if that level of
reliability is is important to your users you can only be down for five
minutes a year so there's big differences here in terms of cost and in
terms of you know like how you're meeting your user expectations okay
you'll notice that on that last slide you know we talked about three nines
four 9s five nines we never talked about a hundred percent uptime or a hundred
percent is the target and been trainers loss our VP of 24 by seven engineering
at Google and the founder of site reliability engineering at Google he
says you know a hundred percent is the wrong reliability target for basically
everything it's it's pretty much an impossible pretty much an impossible
target to achieve and again that's where that's where error budgets come in if
you're not targeting a hundred percent what you know what are you act actually
actually targeting what's going to be an acceptable level of uptime for your
users so era budgets are basically an agreement between product management and
site reliability engineering so you define this availability target based on
what your users are expecting you take a hundred percent you subtract out that
availability target and there you have your budget of unreliability or your
error budget once you have this in place and you've got monitoring in place to
actually measure the actual uptime you've got this control loop for
understanding how you're utilizing utilizing that budget so in a lot of
ways this is taking the emotion out of you know working between devs and sres
the tension that's built in there okay so so once again what what are some of
the benefits of using this this particular concept so era budgets
provide a common incentive for both devs and site reliability engineers so it's
about finding that balance between innovation and reliability and again
take like you're agreeing this in advance so you're taking some of the
emotion out of the conversation the error budgets allow dev teams to
actually manage risk for themselves so they can decide how they want to spend
that error budget so is it on launching you features is it experimenting you
know how are they gonna how are they gonna do this it also makes unrealistic
reliability goals unattractive because if the higher you set that bar will
lower your error budget and then you're really dampening the velocity of
innovation that's possible dev teams become self policing so the air the air
budget is this valuable resource for them you know again they can move as
fast as they as fast as they can up until the point that that budget is
exhausted and then you know we really have to dial it back and focus on
reliability error budgets also enforce this and and really yeah make it clear
that there's shared responsibility for a system uptime it's not just throw it
over the fence to the operations team you know we're working together to keep
the system up and running at an appropriate level all right so talking
about error budgets it actually all boils down to three three concepts so we
talked about SLI is SLO as an SLA SLI Zoar service level indicators
so this is you how do you actually measure how do you actually observe and
measure that your system is or that you're successful enough the SLO is your
service level objectives so this is your top-line target for the the fraction of
good interactions so it's it's the goal like what you know is it a three nine
service is it a foreign ion service like what are you aiming for and then the SLA
or service level agreement so this is about contracts this is about what
you're promising your your users and there can be consequences so if you
don't meet your SLA then you know maybe you have to you pay penalties or
something so typically SLA would be your SLO plus
a bit of a bit of a buffer okay so let's talk a little bit about SLO definition
and measurement again s lows are a target you know for SLI is aggregated
over time and it's important that you're trying to exceed your SLO target but not
by too much because if you exceed it by too much you're leaving velocity on the
table and choosing an appropriate SLO can be
can be quite complex so it's useful to keep it simple from the start get
something out there get something documented to start the conversation and
then that can be evolved over time so setting s loz are important because
again it sets priorities and constraints for essays and and devs and sets user
expectations about levels of service that they that they can expect so other
questions to think about when you're defining your SLO s or considering s
ellos so where's your SLO documents how do you know that your SLO that you've
set actually matches your customer expectations how often do you review
your SLO do you consider your SLO in your system design process and then how
do you actually measure SLO compliance so good good to kind of have these
things in minds okay so the great things about so service
level objectives and s re really help bridge the business development and and
operations giving this set of stakeholders a common language so unlike
here agile which operates at the interface between business and
development and development and operations which is the traditional
domain of DevOps and if we if we look here in just just talking about SLO s
and era budgets we've actually addressed three key areas of of importance to
DevOps so you know you've got the element of shared ownership
you've got error budgets which acknowledges that failure is going to
happen and then you have ways to measure reliability so so again you're making
data-driven decisions and removing some of that emotion from the teams that are
working together alright so let's move on now and talk a little bit about the
practices of sre so SLO s and error budgets are sort of key principles but
then how do we actually defend that SLO and do that on a day to day basis so we
need to we need to work on a few key areas of practice so the key Aria is a
practice for site reliability engineering its metrics and monitoring
capacity planning change management emergency response and don't ever
underestimate the cultural component to to this as well so monitoring and
alerting so monitoring is really your primary means of determining and
maintaining reliability you know how do you know if you're meeting your SLO if
you if you don't have appropriate measures in place alerting so when
you're thinking about alerting this is about triggering notifications when
certain threshold conditions are met so you'll get paged if there is a situation
where immediate human response is required so if your SLO is at risk if
you're about to burn your entire error budget a page is appropriate to the
person that's on call a ticket would be so an alert could generate a ticket for
example if the human needs to take action but you know immediacy is not not
critical in that particular case and I think the other thing to point out here
is you only want to involve humans when your SLO is threatened so you shouldn't
have humans you know watching - dashboards like a hawk reading log files
so just - just to determine if the systems ok having appropriate monitoring
and alerting solves for this and you know you'll be alerted when bad things
happen ok so demand forecasting casting and capacity planning is is an important
element as well so you know public cloud cloud skills infinitely in theory but
the cost associated with adding more resource to your service you know your
CFO might not be happy if you're like just throw a bunch more machines at the
problem so it's important to actually understand you know how much capacity do
you need for your service taking into account things like organic growth so
are you seeing increased product adoption and usage by your customers are
you expecting any in organic growth so you know seasonality feature launches
marketing campaigns etc and what's that going to do to your to
your service and then it's important to be able to correlate the raw resources
that you're using to run your service to that service capacity so if you have you
know X amount of resources how much QPS will that server etc and you want to
make sure I want to make sure that you have enough spare capacity to meet your
reliability reliability goals again we can talk a little bit more about
efficiency and performance so as I was mentioning extra capacity can be quite
expensive so you really want to you know run the gauntlet of utilization and
optimizing your utilization so resource use is typically a function of demands
so your your load on your system your capacity your software efficiency and as
a service really you you need to have good prediction good provisioning and
suv's need to be able to modify the software as needed
s3 is also monitor utilization in performance so if there is a regression
the SRS can act act upon that less experienced teams might do this by
throwing more resources at the problem more mature teams or experienced teams
do this by og this this this feature that we just launched or this this thing
that we've just launched is causing problems it let's maybe roll back figure
out what's happening before we add more resources to the problem
okay so change management is also also pretty critical to site reliability
engineering practice so at Google we've actually found that about 70% of outages
are due to changes made to a live system whether it's a binary push whether it's
a configuration change and change is basically a constant I think where I
work where you are change is unavoidable so we're constantly making changes to
our live system so how do we actually manage this risk associated with with
change so there's a few things we think about mitigations so Canaria vowel outs
to make sure that you know we don't just launch something with one big bang and
you know terrible things happen we also focus in on you know quickly and
accurately detecting problems so having good monitoring to minimize the mean
time to detect a problem and then having the ability to roll back changes safely
when problems arrive to rise to help with mean time to resolve and the other
important thing here is if you remove humans from the loop with automation
this reduces errors reduces fatigue and actually can can improve velocity in a
lot of ways machines can react a lot faster than a human can
and can actually impact those you know time to resolve type of metrics okay so
we've talked a little bit already about how a hundred percent is basically the
wrong reliability target for anything you want to determine this desired
reliability for your product and then don't try and provide better quality
than that that's desired or expected by your users or that your user users will
even notice and again the goal here is not to go slow but to go as fast as
possible given the error budget that we have so you know you go as fast as
possible until you exhaust that error budget and then
you know you turn inward to look back at the reliability so the goal is to
increase development velocity and not to have zero outages but to achieve that
math maximum velocity within that particular budget so again devs can
decide to use that era budget for releases experiments whatever they think
is is going to give them the most bang for the for the buck okay so what about
what about when things go wrong the other fourth key area of responsibility
for site reliability engineers is this emergency response and incident incident
management function so what happens when things break so you know things break
that's life it's just a matter of when and how serious it's gonna be when it
does break it is important to note that people don't that's necessarily react
well to emergencies it's pretty it could be a pretty stressful time so you know
first of all focusing in on don't panic you're not alone people are not gonna
die if your software goes down at least not most likely I sure in the healthcare
industry maybe aviation this excludes a few examples here but yeah once yeah
what what you want to do here then is figure out you've got to mitigate
troubleshoot and fix the fix the issue and don't be afraid to ask for help I
think that's a key a key element here as well and again since since Incident
Response can be very very stressful it's actually important to put well-defined
processes in place and to practice these before you hit a real a real situation
it adds confidence it you know helps you stay common in an emergency and be able
to focus on getting your service back up and running and then once that's done
like once you sort of solve the the initial issue taking some time
afterwards to go make sure that the problem doesn't happen again all right
so let's talk a little bit more about incident and post-mortem thresholds
so one of the one of the issues that often comes up is people don't pull in
help when they need it so you know don't ask for help or just like I got this I
can I'm gonna be the hero I'm gonna solve this issue but I think one of one
of the things you can do to help guard against this is to define incident
thresholds so at what point do you declare an incident and therefore make
additional resources available to help with the with the problem to bring in
other responders to help and some thresholds you might consider are you
know if there's user-visible downtime or a degradation beyond some specified
threshold data loss of any kinds if the on-call engineer has made any sort of
significant intervention and/or if there's a resolution time above above
some threshold but defining this in advance before an incident starts is
really important to take the guesswork out of the situation and I'm sorry
oftentimes these thresholds are tied to you know how much damage are you doing
to your SLO how much damage are you doing to your to your error budgets if
your SLO isn't at risk no big deal if it is you know you want to get in there and
put more resources on the problem okay so now let's talk a little bit about
post-mortem philosophy so whenever there's an incident yo your pet you're
paying a price there's something something has gone wrong you've paid the
price a post-mortem is the gift that you give to yourself is the gift that you
give to yourself to ensure that you you you learn you learn from from what
happened and do what you can to make sure that the incident doesn't doesn't
happen again so you want to make sure the incidents documented that the root
causes are well understood and effective preventative measures are put in place
to prevent that likelihood of Rhian Curren and post-mortems writing a post
Marone is not meant to be a punishment it's you know it's just something that's
expected and can be be quite helpful after a bad event i can't underestimate
the importance of blamelessness in this process and as a key component of site
reliability engineering culture so post mortems need to focus on identifying
contributing causes but not point pointing figures fingers at people or at
humans are never the cause of incidents it's all about like what was it about
the system that allowed the human to do the thing that you know that finger
something and and it went wrong so Polly most post-mortem assumes everyone
involved has good good intentions human errors overly system errors so you can't
fix people but you can fix systems to make it you know easier to do the right
thing harder to do the wrong thing and if you know this culture of
finger-pointing prevails people are not going to bring issues forward they're
gonna sweep things under the rug if they're worried about you know their
co-workers coming after them with pitchforks or losing their job because
they did something something wrong and that's just make sure systems more
brittle and is you know bad for everyone all right I got a crank through because
I'm running out of time but oil management I want to say a few words
about that so operational work is important for site reliability engineers
because you get experience with real failures you know it's the wisdom of
operations you can't automate everything but if you
do enough ops work you know what where the big wins are in terms of your
automation toil is we consider that work that's manual repetitive things that can
be automated that are tactical without injuring value or that grow with
the size of your service so effectively you're feeding blood to the machines and
in site reliability engineering at Google we have this toil and operational
work at 50% of the engineers time operational work is important but if
you're doing more than 50% you're not an SRE and I did want to I did want to
mention that empowering S aureus is super critical so there needs to be this
organizational buy-in or stuff it just doesn't work so s eries must be
empowered to enforce the error error budget and and toil toil limits at
Google at least a Suri's are about 10% of the overall engineering population so
only the most important services have a sorry support so you want to use their
their resources wisely wisely and then we don't want us Ariz to take on too
much of an operational bird burden allow them to load shed or hand back the pager
to developers if you know if the system needs additional
work all right so to recap metrics and monitoring capacity planning change
management emergency response and these cultural elements are critical to us re
practice we've talked about how sorry how
sree implements the five key areas of DevOps so when when you factor in layer
on the automation layer on the plane most post-mortems we've now got these
five key areas covered super quickly how to get started if you want to embark on
this journey the first thing you can do is start with service level objectives
so put an SLO and an air budget in place and then iterate from there hire people
who write software like like I said we joke that we hire lazy and engineers so
people that get quickly bored by doing the same things over and over again you
really need to work as an organization to ensure parity of respect with the
rest of the development and engineering organization and again provide that
feedback loop of self-regulation so the sre teams choose their work and they
must be able to load shed if their ops load gets too high start small canary it
go from there
and then once once you've done that once you have some success that you can point
to and some data that you can point you spread spread the love and to conclude I
just wanted to highlight with you if you're interested in learning more on
the subject we've written a we've written a couple
of books about site reliability engineering the first book was written
back in 2016 and talks about foundational principles practices and
culture of the discipline google also published the site reliable site
reliability workbook which talks about more practical examples of how do you
implement sre concepts and then seeking sree is more of an industry-wide review
of site reliability engineering practices Google comm /soe has the full
text of the original book the second book will be coming out in January I
believe and there's lots of other interesting resources there as well and
with that I'm out thanks folks
Không có nhận xét nào:
Đăng nhận xét