DevOps vs SRE

The definition of SRE (Site Reliability Engineering) sounds like DevOps, but if we consider DevOps as a philosophy, then SRE is a concrete the implementation of this philosophy.

DevOps is broken into 5 key areas:

Reduce Organization Silos
Accept Failure as Normal
Implement Gradual Change
Leverage Tooling and Automation
Measure Everything

Let’s imagine that these pillars are an interface for a class. Then implementation of SRE are:

Reduce Organization Silos
- Share ownership with developers to create shared responsibility
- SREs use the same tools that developers use, and vice versa
Accept Failure as Normal
- Blameless postmortems
- Error budget. How much the system is allowed to go out of spec.
Implement Gradual Change
- To roll out features on small about of people by reducing the cost of failure
Leverage Tooling and Automation
- Automate manual work
Measure Everything
- To measure how much toil we have
- Measuring the reliability and health of the system

SLO & SLA

SLO (Service Level Objectives) - targets which allow for a company support a specific level of reliability to balance between delivering new features and support reliability. For the most development organization is most important to build new features as fast as possible. On the other hand with that approach, we reduce the service’s reliability consequently it decreases users’ experience. Another extreme case is to maintain reliability as much as possible. Then we don’t have features in this case. So SLOs tell when reliability is enough to keep features delivering in good pace and keep reliable your system for customers.

SLA (Service Level Agreements) - agreements with your customers about reliability which you will be fined after a violation.

SLO is a good metric of happy customers. When you breach your SLO is better to start reducing it to avoid reaching SLA when you will have consequences.

SLI (Service Level Indicators) - reliability measurements. It uses for specifying you SLO and SLA metrics (i.e. API response time, latency, error rate etc.).

Error Budget

When we agreed our SLO. Let’s say we have 99.9% uptime as our SLO. 0.1% is 40.32 minutes of downtime (28 days * 24 hours * 60 minutes * 0.1%) which is Error Budget. Error budget tells how unreliable your a service is going to be. You can spend this budget as you want (updating a software, releasing new features, for hardware failure etc.)

What to improve?

Failure Scheme Take a look at the formula: Error Formula

To improve reliability we can optimize any variable in the formula:

TTD (time-to-detect)

good monitoring system
automated alerting

TTR (time-to-resolve)

developing a playbook
easy to read an application log
redirecting traffic from the failed zone

Impact

roll out features gradually

TTF (time-to-failure)

automatically direct traffic away from failure zones

There are additional approaches to improve reliability apart of formula variables.

Reports on uneven error budget spend. Find cases where the error budget not evenly distributed and focus extra efforts in these regions.
Standardize infrastructure. Some services wouldn’t work in diffident infrastructure. So it’s good to keep it similar.
Consult on system design. It can catch reliability design issues early, which is often much cheaper than fixing them after the code has been written.
Build safe release and rollbacks. Developing an automated release/rollback system decrease human errors when it really needs.
Author postmortems. It can highlight what those bugs are and create action items to get them fixed.
Use phased rollouts. You can release to production gradually and decrease possible impact on target users.

Summary

We know how to define problem space (SLOs and SLIs), can make your system is reliable enough, but no more. Error budgets are our trade-off between reliability and development velocity. SLO can be and must be aligned during the service lifetime. The relationship has to be strong to make this work.

Part 2: link
Course: link

Site Reliability Engineering: Measuring and Managing Reliability (part 1)

Table Of Content

Table of Contents

DevOps vs SRE

SLO & SLA

Error Budget

What to improve?

Summary