SRE: How Google Runs Production Systems

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/74

flashcard set

Earn XP

Description and Tags

Flashcards for revising knowledge from Site Reliability Engineering: How Google Runs Production Systems

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

75 Terms

1
New cards

What are the direct and indirect costs of a traditional ‘Ops’ team

The direct costs are that the team scales linearly with the product scale due to manual work. The indirect cost is that separate ‘dev’ and ‘ops’ teams have different goals which leads to conflict between the teams

2
New cards

How much of an SREs time should be spent doing manual or ‘Ops’ work? Why?

A cap of 50% of an SREs time should be spent doing manual work. This is to ensure 50% of their time is spent improving the system to reduce manual work.

3
New cards

What is the difference between an automatic and automated system?

An automated system is one that can resolve an issue but requires manual intervention to start this process, e.g. starting an automated script when a certain failure occurs. An automatic system is one that can trigger this process automatically when the failure conditions are met and doesn’t require any manual intervention

4
New cards

What actions should be taken if SREs are spending more than 50% of their time on manual work?

Either additional SREs should be added to the team without assigning additional operational responsibilities or manual work should be offloaded to the development team

5
New cards

What are the eight main responsibilities of SRE teams?

  • Availability

  • Latency

  • Performance

  • Efficiency

  • Change Management

  • Monitoring

  • Emergency Response

  • Capacity Planning

6
New cards

What is the benefit of offloading manual work to development teams if SREs go above the 50% cap?

This provides a feedback mechanism where developers are incentivised to build systems that don’t require manual intervention otherwise they may need to support it. It also reduces the amount of new code being deployed which could further increase the amount of manual work

7
New cards

How many events should an SRE engineer recieve on a 8-12 hour on call shift?

A maximum of two

8
New cards

How is an error budget calculated?

One minus the services availability target. e.g. a system aiming for 99.9% availability has an error budget of 0.1%

9
New cards

When should a monitoring system alert a human?

When a human needs to take action

10
New cards

What are the three types of monitoring output?

Alerts, tickets and logging

11
New cards

What is MTTF and MTTR?

Mean Time to Failure and Mean Time to Repair

12
New cards

What is the main cause of outages in a live system?

Changes. These account for ~70% of system outages

13
New cards

Name three strategies for preventing outages when rolling out a change

  • Progressive rollouts

  • Quickly and accurately detecting problems

  • Tested rollback process for when problems arise

14
New cards

In capacity planning what is organic and inorganic growth?

Organic growth is natural growth through gradual product adoption. Inorganic growth is sudden growth through feature launches or marketing campaigns

15
New cards

Why can increasing service reliability be bad for users?

Increasing reliability can cost more to develop, run and maintain. It also means fewer features can be developed and these features take longer to develop

16
New cards

Why might a user not notice the difference between high and extreme reliability?

If the user experience is dominated by less reliable components. e.g. a user on a 99.9% reliable smartphone will not notice the difference between 99.99% and 99.999% service reliability

17
New cards

Why do you not want to drastically exceed your reliability target?

This can come at the cost of agility or operational cost and can set unrealistic user expectations

18
New cards

What two ways are there to calculate availability?

Time based availability (uptime/(uptime+downtime)) and aggregate availability (successful requests/total requests)

19
New cards

Why might aggregate availability be used instead of time based?

Aggregate availability more accurately captures partial failures which are more common in global distributed systems that may not have total outages. It also puts higher importance on times when there are high traffic.

20
New cards

What four factors need considering when assessing the risk tolerance of a service?

  • What level of availability is required

  • Do different failure types have different effects on the service

  • Service cost and revenue

  • What other service metrics are important

21
New cards

Give five questions that should be asked when setting service availability targets

  • What level of service do users expect?

  • Does the service tie directly to revenue (ours or customers)?

  • Is the service paid for or free?

  • What availability do competitiors provide?

  • Is the service targeted at consumers or enterprise?

22
New cards

Why are different types of failures important to consider when identifying risk tolerance of a service? Give some examples

Some failures may have larger impacts on the service or service users. For example some services may prefer frequent partial failures to infrequent total outages, certain data corruption may be more tolerable or certain time windows may be more critical (9-5)

23
New cards

How may lost revenue from an outage impact availability targets?

If lost revenue exceeds the cost to make the service more reliable the availability targets should be made higher

24
New cards

A $1M revenuce service is moving from 99.9% to 99.99% availability. Calculate the increase in revenue

Increase in availability = 99.99% - 99.9% = 0.09% = 0.0009

Increase in revenue = $1M × 0.09% = 1,000,000 × 0.0009 = $900

25
New cards

Why is it important to identify other important metrics when defining service risk tolerance?

This helps guide decisions with tradeoffs or to take more considered risks. e.g. identifying latency targets which have high impact to the service when crossed

26
New cards

What is an error budget?

A metric that determines how unreliable a service is allowed to be over a period of time, usually downtime or failed queries allowed per quarter

27
New cards

What action should be taken when an error budget is breached?

Releases should be stopped (or at least slowed)

28
New cards

What is the advantage of stopping releases when an error budget is breached?

It reduces the chance of further failure, encourages developers to write more stable code and allows them to spend time improving service reliability instead

29
New cards

What is an SLI?

A Service Level Indicator is a measure of a service that indicates the level of service provided. e.g. measuring latency or uptime

30
New cards

What is an SLO? Give some examples of SLOs for latency and availability

A Service Level Objective is a target value or range of values for an SLI. e.g. A service may have an SLO of latency < 100ms or availability target of > 99.9%

31
New cards

How should an SLI relate to an SLO in an ideal system?

A service SLI should be within the bounds of the SLO
lower SLO bound < SLI < upper SLO bound

32
New cards

What cause the GCP outage in October 2019?

Dependencies in Bigtable and GFS relied on Chubby always being available as it had gone so long without downtime. A temporary Chubby outage then caused errors in these components which caused a cascading failure

33
New cards

What is google Chubby?

A lock service for distributed systems used by Bigtable and GFS

34
New cards

How did google prevent Chubby causing another outage similar to the one in October 2019

If Chubby drastically exceeds it’s availability SLO then it is manually taken offline to flush out any unreasonable dependencies

35
New cards

What is an SLA?

A Service Level Agreement is a contract with your users that includes the consequences of breaching the SLOs they contain (usually in the form of financial penalties)

36
New cards

What four SLIs are generally most important for user facing systems?

  • Availability

  • Latency

  • Throughput

  • Correctness

37
New cards

When measuring latency what is more useful, client or server latency?

Client latency as it more closely reflects the user experience (although it is more difficult to measure)

38
New cards

What issues may be hidden by using an average latency as an SLI?

If some requests take significantly longer than the rest

39
New cards

Why shouldn’t an SLO be set based on current performance?

Because it is not based on the user and may require heroic future effort to meet target

40
New cards

What five principles should be used when choosing SLO targets?

  • Dont pick a target based on current performance

  • Keep it simple

  • Avoid absolutes

  • Have as few SLOs as possible

  • Perfection can wait, start loose and tighten

41
New cards

What is an SLO safety margin?

Having a tighter internal SLO to advertised SLO so you have room to respond to issues before they become visible externally

42
New cards

Why should you aim to have as few SLOs as possible?

So these SLOs accurately capture what is important to the system and can guide decision making

43
New cards

Name three strategies for avoiding over dependence on a system

  • Planned outages (Chubby)

  • Throttling requests

  • Designing the system so it isn’t faster under lighter loads

44
New cards

Define Mean Time to Failure

The average time between system outages

45
New cards

Define Mean Time to Repair

The average time to bring a system back online after an outage

46
New cards

What can more than two events per 8-12 hour on call shift lead to?

More than two can lead to pager fatigue, problems can’t be investigated properly or post mortems cannot be written

47
New cards

When should a monitoring system create a ticket?

When a human needs to take action but not immediately

48
New cards

What is the purpose of a monitoring system taking logs?

They may be useful in diagnosing issues in future alerts or tickets

49
New cards

What are some absolutes that should be avoided when setting SLOs?

“scale infinitiely” or “always available”

50
New cards

What does keep it simple mean when setting SLOs?

Avoiding complex aggregations

51
New cards

What is toil? Give six attributes that help to categorize work as toil

Toil is work related to running a production system that tends to be manual, repetitive, automatable, tactical, have no enduring value and is O(n) with service growth

52
New cards

Is automatable work more or less likely to be toil?

More likely

53
New cards

Is work that requires human judgement more or less likely to be toil?

Less likely

54
New cards

What is tactical work? Is it more or less likely to be toil?

Interrupt driven, reactive work such as pager events. It is more likely to be toil

55
New cards

What is strategic work? Is it more or less likely to be toil?

Strategic work is proactive work and is less likely to be toil

56
New cards

What does work being O(n) with service growth mean?

As the service grows (service size, user count, traffic volume) the amount of work scales linearly

57
New cards

Is work being O(n) with service growth more or less likely to be toil?

More likely

58
New cards

What is overhead work? Give some examples

Administrative work not tied to running a service. e.g. hiring, HR paperwork, company meetings, training

59
New cards

When checking if a team is spending more than 50% of their time doing toil what time frame should be used? Why?

A few quarters or a year as toil work is often ‘spiky’

60
New cards

What are seven disadvantages of too much toil?

  • Career stagnation

  • Low morale

  • Creates confusion

  • Slows progress

  • Sets precedent

  • Promotes attrition

  • Causes breach of faith

61
New cards

What precedent can be set if an SRE team takes on too much toil

That toil work can be offloaded to the SRE team, either directly or indirectly through code that requires toil

62
New cards

What is white box monitoring? Give three examples of metrics

Monitoring based on the internals of a system. e.g. number of HTTP requests, resource usage, server side latency

63
New cards

What is black box monitoring? Give two examples of metrics

Monitoring based on externally visible behaviour as a user would see it. e.g. page availability, client side latency

64
New cards

What types of problems do black and white box monitoring detect?

Black box detects active problems and white box imminent problems

65
New cards

What ratio of white to black box monitoring uses should there be?

Heavy use of white box monitoring and modest use of black box

66
New cards

What are the four golden monitoring signals? (one word each)

  • Latency

  • Traffic

  • Errors

  • Saturation

67
New cards

Why is it important to track latency of successful and failed requests separately?

Errors may fail very quick or slowly so factoring these into your overall latency may lead to misleading results

68
New cards

In the four golden monitoring signals what is traffic measuring? Give an example metric for a web service

How much demand is being placed on your systems. e.g. HTTP requests per second

69
New cards

In the four golden monitoring signals what are the three categories of errors? Give and example of each

Failed requests can be explicit (HTTP 500 requests), implicit (HTTP 200 requests containing incorrect content) or by policy (Response time over SLO)

70
New cards

In the four golden monitoring signals what does saturation measure? What metrics should be emphasized?

Saturation measures how ‘full’ your service is emphasizing the resource the service is most constrained by

71
New cards

What does ‘worrying about your tail’ mean in terms of monitoring?

It’s often better to look at the distribution of a metric rather than the mean as infrequent anomalies in the metric dominate the user experience (e.g. 1% of requests taking 50x the average latency)

72
New cards

Why is it important to use appropriate resolution when recording metrics?

If the resolution is too high then the cost to collect, store and analyze will be too high. If it is too low then it becomes impossible to detect issues.

73
New cards

What are metric buckets and why are they used?

Instead of recording absolute metric values, different buckets that store approximate values are uses to produce a histogram. For example measuring CPU utilization to the nearest 5% every second, incrementing the value in the corresponsing bucket and at the end of the minute returning the histogram. This reduces the cost of collecting, storing and analyzing metrics.

74
New cards

What five questions should be asked when creating a new monitoring rule or alert?

  • Is the alert urgent, actionable and active or imminently user-visible?

  • Can I ever ignore the alert?

  • Are there cases where the alert is triggered and users aren’t impacted, such as drained traffic or test deployments?

  • How urgent is action needed and is it a short term workaround or long term fix?

  • Are other people also getting paged for the same issue?

75
New cards

Why might you reduce the availability SLO of a service to improve availability?

It reduces the amount of effort spent on short term availability improvements (fire fighting) so more effort can be spent on long term improvements.