SRE: How Google Runs Production Systems

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/74

Earn XP

Description and Tags

Flashcards for revising knowledge from Site Reliability Engineering: How Google Runs Production Systems

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

75 Terms

New cards

What are the direct and indirect costs of a traditional ‘Ops’ team

The direct costs are that the team scales linearly with the product scale due to manual work. The indirect cost is that separate ‘dev’ and ‘ops’ teams have different goals which leads to conflict between the teams

New cards

How much of an SREs time should be spent doing manual or ‘Ops’ work? Why?

A cap of 50% of an SREs time should be spent doing manual work. This is to ensure 50% of their time is spent improving the system to reduce manual work.

New cards

What is the difference between an automatic and automated system?

An automated system is one that can resolve an issue but requires manual intervention to start this process, e.g. starting an automated script when a certain failure occurs. An automatic system is one that can trigger this process automatically when the failure conditions are met and doesn’t require any manual intervention

New cards

What actions should be taken if SREs are spending more than 50% of their time on manual work?

Either additional SREs should be added to the team without assigning additional operational responsibilities or manual work should be offloaded to the development team

New cards

What are the eight main responsibilities of SRE teams?

Availability
Latency
Performance
Efficiency
Change Management
Monitoring
Emergency Response
Capacity Planning

New cards

What is the benefit of offloading manual work to development teams if SREs go above the 50% cap?

This provides a feedback mechanism where developers are incentivised to build systems that don’t require manual intervention otherwise they may need to support it. It also reduces the amount of new code being deployed which could further increase the amount of manual work

New cards

How many events should an SRE engineer recieve on a 8-12 hour on call shift?

A maximum of two

New cards

How is an error budget calculated?

One minus the services availability target. e.g. a system aiming for 99.9% availability has an error budget of 0.1%

New cards

When should a monitoring system alert a human?

When a human needs to take action

New cards

What are the three types of monitoring output?

Alerts, tickets and logging

New cards

What is MTTF and MTTR?

Mean Time to Failure and Mean Time to Repair

New cards

What is the main cause of outages in a live system?

Changes. These account for ~70% of system outages

New cards

Name three strategies for preventing outages when rolling out a change

Progressive rollouts
Quickly and accurately detecting problems
Tested rollback process for when problems arise

New cards

In capacity planning what is organic and inorganic growth?

Organic growth is natural growth through gradual product adoption. Inorganic growth is sudden growth through feature launches or marketing campaigns

New cards

Why can increasing service reliability be bad for users?

Increasing reliability can cost more to develop, run and maintain. It also means fewer features can be developed and these features take longer to develop

New cards

Why might a user not notice the difference between high and extreme reliability?

If the user experience is dominated by less reliable components. e.g. a user on a 99.9% reliable smartphone will not notice the difference between 99.99% and 99.999% service reliability

New cards

Why do you not want to drastically exceed your reliability target?

This can come at the cost of agility or operational cost and can set unrealistic user expectations

New cards

What two ways are there to calculate availability?

Time based availability (uptime/(uptime+downtime)) and aggregate availability (successful requests/total requests)

New cards

Why might aggregate availability be used instead of time based?

Aggregate availability more accurately captures partial failures which are more common in global distributed systems that may not have total outages. It also puts higher importance on times when there are high traffic.

New cards

What four factors need considering when assessing the risk tolerance of a service?

What level of availability is required
Do different failure types have different effects on the service
Service cost and revenue
What other service metrics are important

New cards

Give five questions that should be asked when setting service availability targets

What level of service do users expect?
Does the service tie directly to revenue (ours or customers)?
Is the service paid for or free?
What availability do competitiors provide?
Is the service targeted at consumers or enterprise?

New cards

Why are different types of failures important to consider when identifying risk tolerance of a service? Give some examples

Some failures may have larger impacts on the service or service users. For example some services may prefer frequent partial failures to infrequent total outages, certain data corruption may be more tolerable or certain time windows may be more critical (9-5)

New cards

How may lost revenue from an outage impact availability targets?

If lost revenue exceeds the cost to make the service more reliable the availability targets should be made higher

New cards

A $1M revenuce service is moving from 99.9% to 99.99% availability. Calculate the increase in revenue

Increase in availability = 99.99% - 99.9% = 0.09% = 0.0009

Increase in revenue = $1M × 0.09% = 1,000,000 × 0.0009 = $900

New cards

Why is it important to identify other important metrics when defining service risk tolerance?

This helps guide decisions with tradeoffs or to take more considered risks. e.g. identifying latency targets which have high impact to the service when crossed

New cards

What is an error budget?

A metric that determines how unreliable a service is allowed to be over a period of time, usually downtime or failed queries allowed per quarter

New cards

What action should be taken when an error budget is breached?

Releases should be stopped (or at least slowed)

New cards

What is the advantage of stopping releases when an error budget is breached?

It reduces the chance of further failure, encourages developers to write more stable code and allows them to spend time improving service reliability instead

New cards

What is an SLI?

A Service Level Indicator is a measure of a service that indicates the level of service provided. e.g. measuring latency or uptime

New cards

What is an SLO? Give some examples of SLOs for latency and availability

A Service Level Objective is a target value or range of values for an SLI. e.g. A service may have an SLO of latency < 100ms or availability target of > 99.9%

New cards

How should an SLI relate to an SLO in an ideal system?

A service SLI should be within the bounds of the SLO
lower SLO bound < SLI < upper SLO bound

New cards

What cause the GCP outage in October 2019?

Dependencies in Bigtable and GFS relied on Chubby always being available as it had gone so long without downtime. A temporary Chubby outage then caused errors in these components which caused a cascading failure

New cards

What is google Chubby?

A lock service for distributed systems used by Bigtable and GFS

New cards

How did google prevent Chubby causing another outage similar to the one in October 2019

If Chubby drastically exceeds it’s availability SLO then it is manually taken offline to flush out any unreasonable dependencies

New cards

What is an SLA?

A Service Level Agreement is a contract with your users that includes the consequences of breaching the SLOs they contain (usually in the form of financial penalties)

New cards

What four SLIs are generally most important for user facing systems?

Availability
Latency
Throughput
Correctness

New cards

When measuring latency what is more useful, client or server latency?

Client latency as it more closely reflects the user experience (although it is more difficult to measure)

New cards

What issues may be hidden by using an average latency as an SLI?

If some requests take significantly longer than the rest

New cards

Why shouldn’t an SLO be set based on current performance?

Because it is not based on the user and may require heroic future effort to meet target

New cards

What five principles should be used when choosing SLO targets?

Dont pick a target based on current performance
Keep it simple
Avoid absolutes
Have as few SLOs as possible
Perfection can wait, start loose and tighten

New cards

What is an SLO safety margin?

Having a tighter internal SLO to advertised SLO so you have room to respond to issues before they become visible externally

New cards

Why should you aim to have as few SLOs as possible?

So these SLOs accurately capture what is important to the system and can guide decision making

New cards

Name three strategies for avoiding over dependence on a system

Planned outages (Chubby)
Throttling requests
Designing the system so it isn’t faster under lighter loads

New cards

Define Mean Time to Failure

The average time between system outages

New cards

Define Mean Time to Repair

The average time to bring a system back online after an outage

New cards

What can more than two events per 8-12 hour on call shift lead to?

More than two can lead to pager fatigue, problems can’t be investigated properly or post mortems cannot be written

New cards

When should a monitoring system create a ticket?

When a human needs to take action but not immediately

New cards

What is the purpose of a monitoring system taking logs?

They may be useful in diagnosing issues in future alerts or tickets

New cards

What are some absolutes that should be avoided when setting SLOs?

“scale infinitiely” or “always available”

New cards

What does keep it simple mean when setting SLOs?

Avoiding complex aggregations

New cards

What is toil? Give six attributes that help to categorize work as toil

Toil is work related to running a production system that tends to be manual, repetitive, automatable, tactical, have no enduring value and is O(n) with service growth

New cards

Is automatable work more or less likely to be toil?

More likely

New cards

Is work that requires human judgement more or less likely to be toil?

Less likely

New cards

What is tactical work? Is it more or less likely to be toil?

Interrupt driven, reactive work such as pager events. It is more likely to be toil

New cards

What is strategic work? Is it more or less likely to be toil?

Strategic work is proactive work and is less likely to be toil

New cards

What does work being O(n) with service growth mean?

As the service grows (service size, user count, traffic volume) the amount of work scales linearly

New cards

Is work being O(n) with service growth more or less likely to be toil?

More likely

New cards

What is overhead work? Give some examples

Administrative work not tied to running a service. e.g. hiring, HR paperwork, company meetings, training

New cards

When checking if a team is spending more than 50% of their time doing toil what time frame should be used? Why?

A few quarters or a year as toil work is often ‘spiky’

New cards

What are seven disadvantages of too much toil?

Career stagnation
Low morale
Creates confusion
Slows progress
Sets precedent
Promotes attrition
Causes breach of faith

New cards

What precedent can be set if an SRE team takes on too much toil

That toil work can be offloaded to the SRE team, either directly or indirectly through code that requires toil

New cards

What is white box monitoring? Give three examples of metrics

Monitoring based on the internals of a system. e.g. number of HTTP requests, resource usage, server side latency

New cards

What is black box monitoring? Give two examples of metrics

Monitoring based on externally visible behaviour as a user would see it. e.g. page availability, client side latency

New cards

What types of problems do black and white box monitoring detect?

Black box detects active problems and white box imminent problems

New cards

What ratio of white to black box monitoring uses should there be?

Heavy use of white box monitoring and modest use of black box

New cards

What are the four golden monitoring signals? (one word each)

Latency
Traffic
Errors
Saturation

New cards

Why is it important to track latency of successful and failed requests separately?

Errors may fail very quick or slowly so factoring these into your overall latency may lead to misleading results

New cards

In the four golden monitoring signals what is traffic measuring? Give an example metric for a web service

How much demand is being placed on your systems. e.g. HTTP requests per second

New cards

In the four golden monitoring signals what are the three categories of errors? Give and example of each

Failed requests can be explicit (HTTP 500 requests), implicit (HTTP 200 requests containing incorrect content) or by policy (Response time over SLO)

New cards

In the four golden monitoring signals what does saturation measure? What metrics should be emphasized?

Saturation measures how ‘full’ your service is emphasizing the resource the service is most constrained by

New cards

What does ‘worrying about your tail’ mean in terms of monitoring?

It’s often better to look at the distribution of a metric rather than the mean as infrequent anomalies in the metric dominate the user experience (e.g. 1% of requests taking 50x the average latency)

New cards

Why is it important to use appropriate resolution when recording metrics?

If the resolution is too high then the cost to collect, store and analyze will be too high. If it is too low then it becomes impossible to detect issues.

New cards

What are metric buckets and why are they used?

Instead of recording absolute metric values, different buckets that store approximate values are uses to produce a histogram. For example measuring CPU utilization to the nearest 5% every second, incrementing the value in the corresponsing bucket and at the end of the minute returning the histogram. This reduces the cost of collecting, storing and analyzing metrics.

New cards

What five questions should be asked when creating a new monitoring rule or alert?

Is the alert urgent, actionable and active or imminently user-visible?
Can I ever ignore the alert?
Are there cases where the alert is triggered and users aren’t impacted, such as drained traffic or test deployments?
How urgent is action needed and is it a short term workaround or long term fix?
Are other people also getting paged for the same issue?

New cards

Why might you reduce the availability SLO of a service to improve availability?

It reduces the amount of effort spent on short term availability improvements (fire fighting) so more effort can be spent on long term improvements.