1/74
Flashcards for revising knowledge from Site Reliability Engineering: How Google Runs Production Systems
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What are the direct and indirect costs of a traditional ‘Ops’ team
The direct costs are that the team scales linearly with the product scale due to manual work. The indirect cost is that separate ‘dev’ and ‘ops’ teams have different goals which leads to conflict between the teams
How much of an SREs time should be spent doing manual or ‘Ops’ work? Why?
A cap of 50% of an SREs time should be spent doing manual work. This is to ensure 50% of their time is spent improving the system to reduce manual work.
What is the difference between an automatic and automated system?
An automated system is one that can resolve an issue but requires manual intervention to start this process, e.g. starting an automated script when a certain failure occurs. An automatic system is one that can trigger this process automatically when the failure conditions are met and doesn’t require any manual intervention
What actions should be taken if SREs are spending more than 50% of their time on manual work?
Either additional SREs should be added to the team without assigning additional operational responsibilities or manual work should be offloaded to the development team
What are the eight main responsibilities of SRE teams?
Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning
What is the benefit of offloading manual work to development teams if SREs go above the 50% cap?
This provides a feedback mechanism where developers are incentivised to build systems that don’t require manual intervention otherwise they may need to support it. It also reduces the amount of new code being deployed which could further increase the amount of manual work
How many events should an SRE engineer recieve on a 8-12 hour on call shift?
A maximum of two
How is an error budget calculated?
One minus the services availability target. e.g. a system aiming for 99.9% availability has an error budget of 0.1%
When should a monitoring system alert a human?
When a human needs to take action
What are the three types of monitoring output?
Alerts, tickets and logging
What is MTTF and MTTR?
Mean Time to Failure and Mean Time to Repair
What is the main cause of outages in a live system?
Changes. These account for ~70% of system outages
Name three strategies for preventing outages when rolling out a change
Progressive rollouts
Quickly and accurately detecting problems
Tested rollback process for when problems arise
In capacity planning what is organic and inorganic growth?
Organic growth is natural growth through gradual product adoption. Inorganic growth is sudden growth through feature launches or marketing campaigns
Why can increasing service reliability be bad for users?
Increasing reliability can cost more to develop, run and maintain. It also means fewer features can be developed and these features take longer to develop
Why might a user not notice the difference between high and extreme reliability?
If the user experience is dominated by less reliable components. e.g. a user on a 99.9% reliable smartphone will not notice the difference between 99.99% and 99.999% service reliability
Why do you not want to drastically exceed your reliability target?
This can come at the cost of agility or operational cost and can set unrealistic user expectations
What two ways are there to calculate availability?
Time based availability (uptime/(uptime+downtime)) and aggregate availability (successful requests/total requests)
Why might aggregate availability be used instead of time based?
Aggregate availability more accurately captures partial failures which are more common in global distributed systems that may not have total outages. It also puts higher importance on times when there are high traffic.
What four factors need considering when assessing the risk tolerance of a service?
What level of availability is required
Do different failure types have different effects on the service
Service cost and revenue
What other service metrics are important
Give five questions that should be asked when setting service availability targets
What level of service do users expect?
Does the service tie directly to revenue (ours or customers)?
Is the service paid for or free?
What availability do competitiors provide?
Is the service targeted at consumers or enterprise?
Why are different types of failures important to consider when identifying risk tolerance of a service? Give some examples
Some failures may have larger impacts on the service or service users. For example some services may prefer frequent partial failures to infrequent total outages, certain data corruption may be more tolerable or certain time windows may be more critical (9-5)
How may lost revenue from an outage impact availability targets?
If lost revenue exceeds the cost to make the service more reliable the availability targets should be made higher
A $1M revenuce service is moving from 99.9% to 99.99% availability. Calculate the increase in revenue
Increase in availability = 99.99% - 99.9% = 0.09% = 0.0009
Increase in revenue = $1M × 0.09% = 1,000,000 × 0.0009 = $900
Why is it important to identify other important metrics when defining service risk tolerance?
This helps guide decisions with tradeoffs or to take more considered risks. e.g. identifying latency targets which have high impact to the service when crossed
What is an error budget?
A metric that determines how unreliable a service is allowed to be over a period of time, usually downtime or failed queries allowed per quarter
What action should be taken when an error budget is breached?
Releases should be stopped (or at least slowed)
What is the advantage of stopping releases when an error budget is breached?
It reduces the chance of further failure, encourages developers to write more stable code and allows them to spend time improving service reliability instead
What is an SLI?
A Service Level Indicator is a measure of a service that indicates the level of service provided. e.g. measuring latency or uptime
What is an SLO? Give some examples of SLOs for latency and availability
A Service Level Objective is a target value or range of values for an SLI. e.g. A service may have an SLO of latency < 100ms or availability target of > 99.9%
How should an SLI relate to an SLO in an ideal system?
A service SLI should be within the bounds of the SLO
lower SLO bound < SLI < upper SLO bound
What cause the GCP outage in October 2019?
Dependencies in Bigtable and GFS relied on Chubby always being available as it had gone so long without downtime. A temporary Chubby outage then caused errors in these components which caused a cascading failure
What is google Chubby?
A lock service for distributed systems used by Bigtable and GFS
How did google prevent Chubby causing another outage similar to the one in October 2019
If Chubby drastically exceeds it’s availability SLO then it is manually taken offline to flush out any unreasonable dependencies
What is an SLA?
A Service Level Agreement is a contract with your users that includes the consequences of breaching the SLOs they contain (usually in the form of financial penalties)
What four SLIs are generally most important for user facing systems?
Availability
Latency
Throughput
Correctness
When measuring latency what is more useful, client or server latency?
Client latency as it more closely reflects the user experience (although it is more difficult to measure)
What issues may be hidden by using an average latency as an SLI?
If some requests take significantly longer than the rest
Why shouldn’t an SLO be set based on current performance?
Because it is not based on the user and may require heroic future effort to meet target
What five principles should be used when choosing SLO targets?
Dont pick a target based on current performance
Keep it simple
Avoid absolutes
Have as few SLOs as possible
Perfection can wait, start loose and tighten
What is an SLO safety margin?
Having a tighter internal SLO to advertised SLO so you have room to respond to issues before they become visible externally
Why should you aim to have as few SLOs as possible?
So these SLOs accurately capture what is important to the system and can guide decision making
Name three strategies for avoiding over dependence on a system
Planned outages (Chubby)
Throttling requests
Designing the system so it isn’t faster under lighter loads
Define Mean Time to Failure
The average time between system outages
Define Mean Time to Repair
The average time to bring a system back online after an outage
What can more than two events per 8-12 hour on call shift lead to?
More than two can lead to pager fatigue, problems can’t be investigated properly or post mortems cannot be written
When should a monitoring system create a ticket?
When a human needs to take action but not immediately
What is the purpose of a monitoring system taking logs?
They may be useful in diagnosing issues in future alerts or tickets
What are some absolutes that should be avoided when setting SLOs?
“scale infinitiely” or “always available”
What does keep it simple mean when setting SLOs?
Avoiding complex aggregations
What is toil? Give six attributes that help to categorize work as toil
Toil is work related to running a production system that tends to be manual, repetitive, automatable, tactical, have no enduring value and is O(n) with service growth
Is automatable work more or less likely to be toil?
More likely
Is work that requires human judgement more or less likely to be toil?
Less likely
What is tactical work? Is it more or less likely to be toil?
Interrupt driven, reactive work such as pager events. It is more likely to be toil
What is strategic work? Is it more or less likely to be toil?
Strategic work is proactive work and is less likely to be toil
What does work being O(n) with service growth mean?
As the service grows (service size, user count, traffic volume) the amount of work scales linearly
Is work being O(n) with service growth more or less likely to be toil?
More likely
What is overhead work? Give some examples
Administrative work not tied to running a service. e.g. hiring, HR paperwork, company meetings, training
When checking if a team is spending more than 50% of their time doing toil what time frame should be used? Why?
A few quarters or a year as toil work is often ‘spiky’
What are seven disadvantages of too much toil?
Career stagnation
Low morale
Creates confusion
Slows progress
Sets precedent
Promotes attrition
Causes breach of faith
What precedent can be set if an SRE team takes on too much toil
That toil work can be offloaded to the SRE team, either directly or indirectly through code that requires toil
What is white box monitoring? Give three examples of metrics
Monitoring based on the internals of a system. e.g. number of HTTP requests, resource usage, server side latency
What is black box monitoring? Give two examples of metrics
Monitoring based on externally visible behaviour as a user would see it. e.g. page availability, client side latency
What types of problems do black and white box monitoring detect?
Black box detects active problems and white box imminent problems
What ratio of white to black box monitoring uses should there be?
Heavy use of white box monitoring and modest use of black box
What are the four golden monitoring signals? (one word each)
Latency
Traffic
Errors
Saturation
Why is it important to track latency of successful and failed requests separately?
Errors may fail very quick or slowly so factoring these into your overall latency may lead to misleading results
In the four golden monitoring signals what is traffic measuring? Give an example metric for a web service
How much demand is being placed on your systems. e.g. HTTP requests per second
In the four golden monitoring signals what are the three categories of errors? Give and example of each
Failed requests can be explicit (HTTP 500 requests), implicit (HTTP 200 requests containing incorrect content) or by policy (Response time over SLO)
In the four golden monitoring signals what does saturation measure? What metrics should be emphasized?
Saturation measures how ‘full’ your service is emphasizing the resource the service is most constrained by
What does ‘worrying about your tail’ mean in terms of monitoring?
It’s often better to look at the distribution of a metric rather than the mean as infrequent anomalies in the metric dominate the user experience (e.g. 1% of requests taking 50x the average latency)
Why is it important to use appropriate resolution when recording metrics?
If the resolution is too high then the cost to collect, store and analyze will be too high. If it is too low then it becomes impossible to detect issues.
What are metric buckets and why are they used?
Instead of recording absolute metric values, different buckets that store approximate values are uses to produce a histogram. For example measuring CPU utilization to the nearest 5% every second, incrementing the value in the corresponsing bucket and at the end of the minute returning the histogram. This reduces the cost of collecting, storing and analyzing metrics.
What five questions should be asked when creating a new monitoring rule or alert?
Is the alert urgent, actionable and active or imminently user-visible?
Can I ever ignore the alert?
Are there cases where the alert is triggered and users aren’t impacted, such as drained traffic or test deployments?
How urgent is action needed and is it a short term workaround or long term fix?
Are other people also getting paged for the same issue?
Why might you reduce the availability SLO of a service to improve availability?
It reduces the amount of effort spent on short term availability improvements (fire fighting) so more effort can be spent on long term improvements.