SRE Practices - Incident Management, Error Budgets & Postmortem Culture
advancedsreincident-managementsloerror-budgetpostmortemdevops
Implement Site Reliability Engineering practices including SLOs, error budgets, on-call, and blameless postmortems
← Back to AdvancedLearning Objectives
- Define meaningful SLIs and SLOs grounded in user experience
- Implement error budgets as the mechanism that balances reliability and velocity
- Build and operate an on-call rotation with runbooks and escalation policies
- Conduct blameless postmortems that produce systemic improvements
- Understand the SRE toil concept and how to systematically reduce it
Requirements
You are required to implement a complete SRE practice for the platform built throughout the Advanced level:
- SLI and SLO Definition
- Define SLIs for each critical service:
- Availability SLI: percentage of successful HTTP requests (non-5xx) over total requests
- Latency SLI: percentage of requests served under 300ms (p95)
- Error rate SLI: percentage of requests that returned an application error
- Define SLOs with justified targets:
- Choose targets based on user expectations, not arbitrary numbers
- Define the measurement window: 30-day rolling window
- Document: why this target? what happens if it is breached?
- Implement SLO tracking in Prometheus using recording rules:
# Example: 30-day availability SLO sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d])) - Create an SLO dashboard in Grafana with current SLO status and trend
- Define SLIs for each critical service:
- Error Budgets
- Calculate the error budget for each SLO:
- 99.9% availability SLO → 43.8 minutes of allowed downtime per month
- Document the relationship: SLO determines budget, budget determines velocity
- Implement error budget burn rate alerts in AlertManager:
- Fast burn alert (1h): burn rate > 14.4x - page immediately
- Slow burn alert (6h): burn rate > 6x - notify team
- Budget exhausted: freeze all non-critical deployments automatically
- Create an error budget policy document:
- When budget is healthy (> 50%): normal deployment velocity
- When budget is at risk (10-50%): increase testing requirements
- When budget is exhausted (< 10%): freeze new features, focus on reliability only
- Calculate the error budget for each SLO:
- On-Call Infrastructure
- Set up PagerDuty or OpsGenie (free tier):
- Create an on-call schedule with escalation policies
- Configure alert routing from AlertManager to the on-call tool
- Define escalation: primary → secondary → manager (15-minute intervals)
- Write runbooks for the top 5 most likely alerts:
- Pod crash loop: symptoms, immediate mitigation, root cause investigation, escalation criteria
- High error rate: triage steps, quick rollback procedure, customer communication template
- Disk pressure: immediate actions, data to collect, resolution steps
- Certificate expiry: renewal process, emergency steps if expired
- Database connection exhaustion: pool tuning, scaling steps, rollback
- Each runbook must be: findable (linked from the alert), actionable (specific commands), and completable by an engineer unfamiliar with the service
- Set up PagerDuty or OpsGenie (free tier):
- Incident Management
- Define severity levels for your platform:
- SEV1: full platform outage, all users affected
- SEV2: significant degradation, majority of users affected
- SEV3: partial degradation, minority of users affected
- SEV4: minor issue, no user impact
- Document the incident response process:
- Detection → Triage → Incident declaration → Mitigation → Resolution → Postmortem
- Define roles: Incident Commander, Communications Lead, Technical Lead
- Define communication SLAs: SEV1 customer update every 15 minutes
- Simulate a SEV2 incident:
- Deploy a deliberately broken version of your application
- Follow the incident response process from detection to resolution
- Document every step taken and the timeline
- Define severity levels for your platform:
- Blameless Postmortem
- Write a postmortem for the simulated SEV2 incident using Google's postmortem template:
- Summary: 2-3 sentences describing what happened and its impact
- Timeline: minute-by-minute account of the incident
- Root Cause: the actual underlying cause (not the symptom)
- Contributing Factors: systemic issues that allowed this to happen
- What Went Well: practices and tools that helped
- Action Items: specific, assigned, time-bound improvements (not vague "be more careful")
- The postmortem must identify at least 3 systemic improvements
- Document the toil involved in responding to the incident:
- Manual steps taken
- How each could be automated or eliminated
- Prioritized backlog of reliability improvements
- Write a postmortem for the simulated SEV2 incident using Google's postmortem template:
Stretch Goals
- Implement multi-window multi-burn-rate alerts following the Google SRE Workbook methodology
- Build an automated incident timeline tool that reconstructs events from logs and metrics
- Create a Game Day exercise plan to proactively discover reliability weaknesses
Deliverables
- SLO definitions with Prometheus recording rules and Grafana SLO dashboard
- Error budget burn rate alert configuration with policy document
- On-call schedule and escalation policy in PagerDuty/OpsGenie
- Five runbooks covering the most critical alert scenarios
- Incident severity matrix and response process documentation
- Simulated incident timeline with all steps documented
- Blameless postmortem with at least 3 systemic action items
References
Books
- Site Reliability Engineering - Google (free online)
- The Site Reliability Workbook - Google (free online)
- Seeking SRE - David Blank-Edelman
- Incident Management for Operations - Rob Schnepp et al.
Courses
Tools and Documentation
- Prometheus Alerting Rules
- Google SRE Workbook: Alerting on SLOs
- PagerDuty Incident Response Guide
- Postmortem Template - Google
- SLO Generator - Google
Once you complete this task you will operate your platform the way Google, Netflix, and Amazon operate theirs - with explicit reliability contracts, data-driven decisions about deployment velocity, and a culture that improves systems instead of blaming people.
Submit Your Solution
Completed this project? Share your solution with the community!
- Push your code to a GitHub repository
- Open an issue on our GitHub repo with your solution link
- Share on X with the hashtag #DevOpsDiary
