Google's Site Reliability Engineering Textbook

Going to read Google's Site Reliability Engineering book to learn more about increasing the reliability of a system architecture.

Date Created:

Last Edited:

2 582

References

Site Reliability Engineering, edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly). Copyright 2016 Google, Inc., 978-1-491-92912-4.

Preface

It's still worth putting in lightweight reliability support in place early on, because it's less costly to expand a structure later on than it is to introduce one that is not present.

This book is a series of essays written by members and alumni of Google's Site Reliability Engineering organization. It's much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a art of a coherent while, but a good deal can be gained by reading on whatever subject particularly interested you.

Introduction

Chapter 1: Introduction

In general, a SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the services.
Postmortems should be written for all significant incidents, regardless of whether or not they are pages; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps.
100% is the wrong reliability target for basically everything. Take the following into account when establishing reliability target:
- What level of availability will the users be happy with, given how they use the product?
- What alternatives are available to users who are dissatisfied with the product's availability?
- What happens to users' usage of the product at different availability levels?
Monitoring should never require a human to interpret any part of the alerting domain. Three kinds of valid monitoring:
- Alerts
- - Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
- Tickets
- - Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
- Logging
- - No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR).
SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish:
- Implementing progressive rollouts
- Quickly and accurately detecting problems
- Rolling back safely when problems arise
Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand with the required availability.
Steps to capacity planning:
- An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity
- An accurate incorporation of inorganic demand sources into the demand forecast
- Regular load testing of the system to correlate raw capacity (servers, disks, and so on) to service capacity
SRE's provision to meet a capacity target at a specific response speed, and thus are keenly interested in a service's performance.

Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE

Terminology used throughout the book:
- Machine - a piece of hardware (or perhaps a VM)
- Server - a piece of software that implements a service

Borg is a distributed cluster operating system, similar to Apache Mesos. Borg manages its jobs at the cluster level.

Borg is responsible for running users' jobs, which can either be indefinitely running servers or batch processes like MapReduce.
Jobs can consist of more than one of identical tasks, both for reasons of reliability and because a single process can't usually handle all cluster traffic.

Our software architecture is designed to make the most efficient use of our hardware infrastructure, Our code is heavily multithreaded, so one task can easily use many cores. To facilitate dashboards, monitoring, and debugging, every server has an HTTP server that provides diagnostics and statistics for a given task.

Principles

This section examines the principles underlying how SRE teams typically work - the patterns, behaviors, and areas of concern that influence the general domain of SRE operations.

Chapter 3 - Embracing Risk

At a certain point, increasing reliability is worse for a service (and its users) rather than better

Rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid improvement and efficient service operations, so that users' overall happiness - with features, service, and performance - is optimized.

For most services, the most straightforward way of representing risk tolerance is in terms of acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of nines we would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement towards 100% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime:

Availability is defined in terms of request success rate. Aggregate availability shows how this yield-based metric is calculated over a rolling window

In a typical application, not all requests are qual: failing a new user sign-up request is different from failing a request polling for new email in the background.
The key point in this chapter is that

Chapter 4 - Service Level Objectives

These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we'll react if we can't provide the expected service.
Service Level Indicators (SLIs)
- Carefully defined quantitative measure of some aspect of the level of service provided.
- Most services consider request latency - how long it takes to return a response to a request - as a key SLI
- One common SLI to include is the error rate (expressed as a fraction of all requests received) and system throughput (measured in requests per second)
- Availability - the fraction of time that the service is usable (also called yield)
- Durability - the likelihood that data will be retained over a long period of time
Service Level Objectives (SLOs)
- Target value of range of values for a service that is measured by an SLI.
- A natural structure for SLOs is
Service Level Agreements (SLAs)
- An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial - a rebate or penalty - but they can take other forms

Chapter 5 - Eliminating Toil

Toil is the kind of work tied to a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scares linearly as a service grows
- Manual
- - Includes work such as manually running a script that automates some task. The hands-on time that a human takes running that task is still toil time.
  - - Think about this for picture uploads that need to be resized.
- Repetitive
- Automatable
- - If a machine could accomplish this task just as well as a human, that task is toil. If human judgement is essential for the task, there's a good chance it's not toil.
- Tactical
- - Toil is interrupt driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil.
- No Enduring Value
- - If your service remains in the same state after you have finished a task, the task is probably toil.
- O(n) Service Growth
- - If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil.
Try to keep toil less than 50% of each engineer's time
Typical SRE activities fall into the following approximate categories:
- Software Engineering
- - Involves writing or modifying code, in addition to any associated design and documentation work.
- Systems Engineering
- - Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting improvements from a one-time effort.
- Toil
- Overhead
- - Administrative work not ties to running a service.
Toil becomes toxic in large quantities. Too much toil leads to:
- Career Stagnation
- Low Morale
- Creates Confusion
- Slows Progress
- Sets Precedent
- Promotes Attrition
- Causes Breach of Faith

Chapter 6 - Monitoring Distributed Systems

Definitions
- Monitoring
- - Collecting, processing, aggregating, and displaying real-time quantitate data about a system, such as query counts and types, error counts, processing times, and server lifetimes
- White-box Monitoring
- - Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics
- Black-box Monitoring
- - Testing externally visible behavior as a user would see it
- Dashboard
- - An application that provides a summary view of a service's core metrics. A dashboard may have filters, selectors, and so on, but is prebuilt to expose the metrics most important to its users. The dashboard might also display team information such as ticket queue length, a list of high-priority bugs, the current on-call engineer for a given area of responsibility, or recent pushes.
- Alert
- - A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. Respectively, these alerts are classified as tickets, email alerts, and pages.
- Root Cause
- - A defect in a software or human system that, if repaired, instills confidence that this event won't happen again in the same way. A given incident might have multiple root causes.
- Node and Machine
- - Used interchangeably to indicate a single instance of running a kernel in either a physical server, virtual machine, or container. There might be multiple services worth monitoring on a single machine.
- Push
- - Any change to a service's running software or its configuration.
Why Monitor?
- Analyze long-term trends
- Comparing over time or experiment groups
- Alerting
- Building Dashboards
- Conducting ad hoc Retrospective Analysis (Debugging)

Monitoring a complex application is a significant engineering endeavor in and of itself. Even with substantial infrastructure for instrumentation, collection, display, and alerting in place, a Google SRE team with 10-12 members typically has one or sometimes two member whose primary assignment is to build and maintain monitoring systems for their service.

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can measure four metrics of your user-facing system, focus on these:
- Latency
- - The time it takes to service a request.
- Traffic
- - A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by type of request (static vs dynamic content)
- Errors
- - The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy.
  - - Need to look into implementing non-200 status codes but still process HTML in HTMX in a future project.
- Saturation
- - How full your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O constrained system, show I/O).
Different aspects of a system should be measured with different levels of granularity.

Chapter 7 - The Evolution of Automation at Google

Automation doesn't just provide consistency. Designed and done properly, automatic systems also provide a platform that can be extended, applied to more systems, or perhaps spun out for profit. A platform also centralizes mistakes.
If automation is used to resolve common faults in the system (frequent for SRE-created automation), then it reduces the mean time to repair (MTTR) for common faults.
Time sizing if an oft-quoted rationale for automation.
Automation is the term generally used for writing code to solve a wide variety of problems
Reliability is the fundamental feature of automation - there have been cases where automation is actively harmful.

Chapter 8 - Release Engineering

Release engineering is a relatively new and fast-growing discipline of software engineering that can be concisely described as building and delivering software. Release engineers have a solid (if not expert) understanding of source code management, build configuration languages, automated build tools, package managers, and installers.

Running reliable services requires reliable release processes
Release engineering should be high velocity - user-facing software is rebuilt frequently, as we aim to roll out customer facing features as quickly as possible. We have embraced the philosophy that frequent releases result in fewer changes between versions.
When equipped with the right tools, proper automation, and well-defined policies, developers and SREs shouldn't have to worry about releasing software. Releases can be as painless as simply pressing a button.

Chapter 9 - Simplicity

For the the majority of production software systems, we want a balanced mix of stability and agility.

The term software bloat was coined to describe the tendency of software to become slower and bigger over time as a result of a constant stream of additional features. While bloated software seems intuitively undesirable, its negative aspects become even more clear when considered from the SRE perspective: every line of code changed or added to a project creates the potential for introducing new defects and bugs.

Simple release are generally better than complicated releases. It is much easier to measure and understand the impact of a single change than a batch of changes released simultaneously.
I guess the takeaway from this chapter is to commit often and probably figure out a way to do a staged rollout on AWS of new code.

Practices

Put simply, SREs run services - a set of related systems, operated by users, who may be internal or external - and are ultimately responsible for the health of these services. Successfully operating a service entails a wide range of activities: developing monitoring systems, planning capacity, responding to incidents, ensuring the root causes of outages are addressed, and so on.

Monitoring
- Without monitoring, you have no way to tell whether the service is even working.
Incident Response
- How you respond to something going wrong
Postmortem / Root Cause Analysis
- Building a blameless postmortem culture is the first step in understanding what went wrong.
Testing + Release Procedures
- Once we understand what tends to go wrong, our next step is attempting to prevent it. Test suites offer some assurance that our software isn't making certain classes of errors before it is released to production.
Capacity Planning

Development

Product

Chapter 10 - Practical Alerting

Monitoring, the bottom layer of the Hierarchy of Production Needs, is fundamental to running a stable service. Monitoring enables service owners to make rational decisions about the impact of changes to the service, apply the scientific method to incident response, and of course ensure their reason for existence: to measure the service's alignment with business goals.

Monitoring a very large system is challenging for a couple of reasons:
- The sheer number of components being analyzed
- The need to maintain a reasonably low maintenance burden on the engineers responsible for the system

Chapter 11 - Being On-Call

Being on-call is a critical duty that many operations and engineering teams must undertake in order to keep their service reliable and available.

They manage outages when they happen.

Chapter 12 - Effective Troubleshooting

Start with a problem telling us something is wrong with the system. Then we take a loot at the telemetry and logs to understand its current state. This information, combined with our knowledge of how the system is built, how it should operate, and its failure modes, enable us to identify some possible causes.
We can test out hypotheses in one of two ways. We can compare the observed state of the system against our theories to find confirming or disconfirming evidence. Or, in some cases, we can actively treat the system - that is, change the system in a controlled way - and observe the results. This second approach refines our understanding of the system's state and possible cause(s) of the reported problems. Using either of these strategies, we repeatedly test until a root cause is identified, at which point we can then take corrective action to prevent a recurrence and write a postmortem.
Your first response to a major outage should be to make the system work as well as it can under the circumstances.
- This probably involves sending a page from CloudFront if CloudFront can not connect to the EC2 server.
Logging is very important here.

Chapter 13 - Emergency Response

On trait that's vital to the long-term health of an organization, and that consequently sets that organization apart from others, is how the people involve respond to an emergency.
- You should probably create some scripts that can run when things break. For example:
- - Security Incident -> Have a script that changes the security of the database and server and load balancers to their most strict setting
  - Server Down -> Look into AWS for having multiple servers OR serve a default page from CloudFront
  - Cache Down -> Look into auto scaling cache or something

Chapter 14 - Managing Incidents

Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible. If you haven't gamed out your response to potential incidents in advance, principled incident management can go out the window in real-life situations.

Chapter 15 - Postmortem Culture: Learning from Failure

When an incident occurs, we dix the underlying issue, and services return to their normal operating conditions. Unless we have some formalized process of learning from these incidents in place, they may recur as infinitum.
The primary goals of writing a postmortem are to ensure that the incident is document, that all contributing root cause(s) are well understood, and, especially, that effective preventative actions are put in place to reduce the likelihood and/or impact of recurrence.
Don't point fingers or blame people

Chapter 16 - Tracking Outages

Improving reliability is only possible if you start from a known baseline and can track progress.
At Google, all alert notifications for SRE share a central replicated system that tracks whether a human has acknowledged receipt of the notification. If no acknowledgement is received after a configured interval, the system escalates to the next configured destination(s) e.g. from primary on-call to secondary.
- This could look like starting with an email then escalating to a call (with an automated process) if the acknowledgement has not happened in time.

Chapter 17 - Testing for Reliability

One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this task by adapting classical software testing techniques to systems at scale. Confidence can be measured both by past reliability and future reliability.

Types of Software Testing

Traditional Tests

Unit Tests
1. A unit test is the smallest and simplest form of software testing. These tests are employed to access a separate unit of software, such as a class or function, for correctness independent of the larger software system that contains the unit. Unit tests are also employed as a form of specification to ensure that a function or module exactly performs the behavior required by the system.
Integration Tests
1. Software components that pass individual tests are assembled into larger components. Engineers run integration test on an assembled component to verify that it functions correctly.
System Tests
1. A System test is the largest scale test that engineers run for an undeployed system, All modules to a a specific component, such as a server that passed integration tests, are assembled into the system. Then the engineer tests the end-to-end functionality of the system.
2. Smoke Tests
3. 1. Smoke tests, in which engineers test very simple but critical behavior, are among the simplest type of system tests. Smoke tests are also known as sanity testing, and serve to short-circuit additional and more expensive testing.
4. Performance Tests
5. 1. Variant of a system test to ensure that the system stays acceptable over the duration of its lifecycle.
6. Regression Tests
7. 1. Another type of system test that involves preventing bugs from sneaking back into the codebase.

Production Tests

These tests interact with a live production system in a hermetic testing environment. These tests are in many ways similar to black box monitoring and so are sometimes called black-box testing.

Rollouts Entangle Tests

Configuration Tests

For each configuration file, a separate configuration test examines production to see how a particular binary is actually configured and reports discrepancies against that file. Such tests are inherently not hermetic, as they operate outside the test infrastructure sandbox.

Stress Tests

Engineers use stress tests to find the limit of a web service. Stress tests answer questions such as:
- How full can a database get before writes start to fail?
- How many queries a second can be sent to an application server before it becomes overloaded, causing requests to fail?

Canary Tests

The canary test is when a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion.

Chapter 19 - Load Balancing at the Frontend

When you're dealing with large-scale systems, putting all your eggs in one basket is a recipe for disaster.
This chapter focuses on high-level balancing - how we balance user traffic between datacenters. The following chapter zooms in to explore how we implement load balancing inside a datacenter.
Traffic load balancing is how we decide which of the many, many machines in our datacenters will serve a particular request. Ideally, traffic is distributed across multiple network links, datacenters, and machines in an optimal fashion. But what does optimal mean in this context? There's actually no single answer, because the optimal solution depends heavily on a variety of factors:
- The hierarchal level at which we evaluate the problem (global versus local)
- The technical level at which we evaluate the problem (hardware versus software)
- The nature of the traffic we're dealing with
The first layer of load balancing is DNS load balancing
Virtual IP addresses (VIPs) are not assigned to any particular network interface. Instead, they are usually shared across many devices. In practice, the most important part of VIP implementation is a device called the network load balancer. The balancer receives packets and forwards them to one of the machines behind the VIP. These backends can then further process the request.

Chapter 20 - Load Balancing in the Datacenter

This chapter focuses on load balancing within the datacenter. Specifically, it discusses algorithms for distributing work within a given datacenter for a stream of queries. We cover application-level policies for routing requests to individual servers that can process them.

For each incoming query, a client (server) must decide which backend task should handle the query.
In an ideal case, the load for a given service is spread perfectly over all its backend tasks and, at any given point in time, the least and most loaded backend tasks consume exactly the same amount of CPU. We can only send traffic to a datacenter until the point at which the most loaded task reaches its capacity limit. During that time, the cross-datacenter load balancing algorithm must avoid sending any additional traffic to the datacenter, because doing so risks overloading some tasks.

From a client perspective, a given backend task can be in any of the following states:
- Healthy
- Refusing Connections
- Lame Duck
In addition to health management, another consideration for load balancing is subsetting: limiting the pool of potential backend tasks with which the client task interacts.
Load Balancing Policies - mechanisms used by client tasks to select which backend task in its subset receives a client request. Many of the complexities in load balancing policies stem from the distributed nature of the decision making process in which clients need to decide, in real time and with limited information, which backend should be used for each request.
- Simple Round Robin
- Small subsetting
- Least-Loaded Round Robin
- Weighted Round Robin

Chapter 21 - Handling Overload

Eventually some part of your system will be overloaded. Gracefully handling overload conditions is fundamental to running a reliable serving system. One option for handling overload is to serve degraded responses: responses that are not as accurate as or that contain less data than normal responses, but are easier to compute.
Limiting client requests
Have a fallback plan

Chapter 22 - Addressing Cascading Failures

A cascading failure is a failure that grows over time as a result of positive feedback. It can occur when a portion of an overall system fails, increasing the probability that other portions of the system fail.

Causes of Cascading Failures:
- Server Overload
- Resource Exhaustion
- CPU
- - Increased number of in-flight requests
  - Excessively long queue lengths
  - Thread starvation
  - CPU or request starvation
  - Missed RPC deadlines
  - Reduced CPU caching benefits
- Memory
- - Dying Tasks
  - Increased Rate of Garbage collection (GC) in Java, resulting in increased CPU storage
  - Reduction in cache hit rates
- Threads
- File Descriptors
- Dependencies among resources
- Service unavailability
Preventing Server Overload
- Load test the server's capacity limits, and test the failure mode for overload
- Serve degraded results
- Instrument the server to reject requests when overloaded
- Instrument higher-level systems to reject requests, rather than overloading servers
- Performing Capacity Planning
Load Shedding - drops some proportion of load by dropping traffic as the server approaches overload conditions.
Graceful Degradation- takes the concept of load shedding one step further by reducing the amount of work that needs to be performed.
Retries

Chapter 23 - Managing Critical State: Distributed Consensus for Reliability

Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several data centers in a region. Site Reliability engineers need to anticipate these sorts of failures and develop strategies to keep systems running in spite of them. These strategies usually entail running such systems across multiple sites. Geographically distributing a system is relatively straightforward, but also introduces the need to maintain a consistent view of system state, which is a more nuanced and difficult undertaking.

Chapter 24 - Distributed Periodic Scheduling with Cron

Cron is a common Unix utility designed to periodically launch arbitrary jobs at user-defined times or intervals.
Cron is designed so that the system administrators and common users of the system can specify commands to run and when these commands run. Cron executes various types of jobs, including garbage collection and periodic data analysis. The most common time specification is called crontab
Cron is usually implemented using a single component, which is commonly referred to as crond. crond is a daemon that loads the list of scheduled cron jobs. Jobs are launched according to their specified execution times.

Chapter 25 - Data Processing Pipelines

The classic approach to data processing is to write a program that reads in data, transforms it in some desired way, and outputs new data.
Typically, the program is scheduled to run under the control of a periodic scheduling program such as cron.
This design pattern is called a data pipeline.

Chapter 26 - Data Integrity: What You Read is What you Wrote

Data integrity is a measure of the accessibility and accuracy of the datastores needed to provide users with an adequate level of service.
When considering data integrity, what matters is that services in the cloud remain accessible to users. User access to data is especially important.

Management

This section topics working together in a team, which I don't need right now.

Conclusion

This is a great source of information. I might want to come back and read it in-depth sometime. I would definitely recommend people read it.

Google's Site Reliability Engineering Textbook

Types of Software Testing

Traditional Tests

Production Tests

Rollouts Entangle Tests

Configuration Tests

Stress Tests

Canary Tests

Conclusion

User Comments