10 Real-World DevOps Incident Scenarios — How Senior Engineers Actually Answer
Most DevOps interview content gives you 50 generic questions like "What is CI/CD?" with textbook answers. That is useless. Real interviews in 2026 give you a scenario — "the site is down at 3 AM, walk me through your first 10 minutes" — and watch how you think. This post gives you 10 real-world scenarios with the wrong answer most candidates give, the right answer senior engineers give, the exact commands to mention, and the mental model behind the thinking.
I have been on both sides of these interviews. These 10 scenarios cover roughly 90% of what DevOps and SRE interviews actually ask. Master them and you will pass.
This post covers 10 scenarios. The full 30-scenario version with system design and K8s debugging is in The DevOps Interview Playbook ($15).
The 60-Second Framework (Use This for Every Scenario)
Before diving into specific scenarios, memorize this framework. It works for literally every scenario question:
🧠 The Framework: Clarify (10s) → Hypothesize (10s) → Investigate (30s) → Mitigate (10s)
- Clarify: "Is this all users or a subset? When did it start? Any recent deployments?"
- Hypothesize: "Based on that, the most likely causes are A, B, or C"
- Investigate: "I'd check X first because it's the most common cause"
- Mitigate: "While investigating, I'd also do Y to reduce impact immediately"
Interviewers don't expect the perfect answer. They expect structured thinking. This framework shows you have it.
Scenario 1: "The site returns 502 for all users at 3 AM"
❌ What most candidates say: "I would restart the servers and see if it fixes it."
✅ What senior engineers say: "First, I check the scope — all users or a subset? Then I check if there was a recent deployment. If someone pushed in the last hour, that's the prime suspect and I'd rollback immediately while investigating. If no deployment, I check ALB target group health — unhealthy targets cause 502s. Then database connections — if RDS is maxed on connections, every request fails. While investigating, I communicate in the incident channel."
🧠 Mental Model: Mitigate first, root-cause later. A 2-minute rollback beats a 30-minute debugging session.
💬 The one-liner: "The fastest fix is a rollback. I can root-cause after the site is back up."
Scenario 2: "Latency spiked 10x but nobody deployed anything"
❌ Wrong: "If nobody deployed, I'd just wait and see if it resolves itself."
✅ Right: "No deployment doesn't mean nothing changed. I'd check: (1) Traffic spike — did request count double? (2) Database — Performance Insights for slow queries, a table might have grown past an index threshold. (3) External dependency — if we call a third-party API and they're slow, we inherit their latency. (4) Resource exhaustion — t3 CPU credit exhaustion, disk full, memory pressure. I'd look at the monitoring dashboard for what metric changed at the exact same timestamp as the latency spike — that correlation usually points directly at the cause."
🧠 Mental Model: No deployment ≠ nothing changed. Look for what DID change — traffic, data size, external dependencies, resource limits.
Scenario 3: "AWS access key committed to public GitHub"
❌ Wrong: "I'd delete the key and create a new one."
✅ Right: "This is a security incident. First 5 minutes: (1) DISABLE the key in IAM — not delete, disable. Preserve for audit. (2) Check CloudTrail immediately — any unauthorized API calls? Look for CreateUser, RunInstances, especially in regions we don't use. (3) If breach confirmed, escalate to SEV-1. Next 30 minutes: create new key for the legitimate service, verify, then delete compromised key. Search entire git history for other keys. Post-incident: add pre-commit hooks with git-secrets, migrate the service from IAM user to IAM roles using OIDC federation so long-lived keys don't exist."
🧠 Mental Model: Disable, don't delete. Audit before cleanup. Fix the process, not just the key.
For the full IAM security playbook: AWS IAM Best Practices — 12 Rules
Scenario 4: "Database CPU at 95% during peak"
❌ Wrong: "Upgrade to a bigger instance."
✅ Right: "Bigger instance is the lazy answer — bad queries eat any instance. Immediate: check Performance Insights for the top CPU-consuming queries. Kill any stuck long-running queries. Short-term: add read replicas to offload 50-70% of reads. Add RDS Proxy for connection pooling. Optimize the top 3 worst queries — usually missing indexes or unoptimized joins. Medium-term: add Redis cache to absorb 80-95% of repetitive reads."
🧠 Mental Model: Fix the queries, not the hardware. A bad query on a $10K instance is still a bad query.
💬 The one-liner: "I never upgrade the instance before I know which query is eating the CPU."
Scenario 5: "Developer needs access to production S3 buckets"
❌ Wrong: "I'd give them read access to the bucket."
✅ Right: "First question: WHY? If it's for reading logs → set up CloudWatch Logs or Athena queries instead. If it's for debugging data → create a read-only IAM role scoped to the specific bucket and prefix only, with a time-limited session via SSO and MFA required. If it's for uploading a fix → that goes through the CI/CD pipeline, not manual access. Principle: temporary, scoped, audited, revoked after."
🧠 Mental Model: Always ask WHY before granting access. The best access control is making the access unnecessary.
Scenario 6: "GuardDuty flags EC2 instances talking to crypto miners"
❌ Wrong: "Terminate the instance immediately."
✅ Right: "Do NOT terminate — you need the instance for forensics. Instead: (1) Isolate by changing the security group to no inbound/outbound rules. Network is cut but evidence is preserved. (2) Take an EBS snapshot. (3) Check CloudTrail for how it was compromised. (4) Check for lateral movement to other instances. (5) Launch clean replacement from a known-good AMI. (6) Rotate ALL credentials the compromised instance had access to."
🧠 Mental Model: Isolate, don't destroy. Evidence first, cleanup second.
Scenario 7: "Canary deployment shows 0.5% higher error rate"
❌ Wrong: "0.5% is small, let's proceed with the rollout."
✅ Right: "Context matters. Is the baseline 0.1% errors? Then 0.6% is a 6x increase — significant. Is it 5%? Then 5.5% is noise. If significant: check what type of errors — 500s vs timeouts vs 4xx. Check which endpoints are affected. Roll back the canary. At 1% traffic, rollback cost is near zero. Investigate offline."
💬 The one-liner: "If in doubt, roll back. Canary rollback cost = zero. Bad full rollout cost = outage."
Scenario 8: "Docker builds locally but fails in CI"
❌ Wrong: "Works on my machine."
✅ Right: "Almost always an environment difference. Check: (1) Docker version — newer syntax fails on older CI Docker. (2) Missing files in .gitignore — local build uses uncommitted files. (3) Network — CI behind a firewall, npm/apt can't reach registries. (4) Architecture — M1 Mac (ARM) vs CI (x86). Add --platform linux/amd64. (5) Cached layers — local reuses old layers, CI starts clean."
Scenario 9: "SSL certificate expired in production"
❌ Wrong: "Buy a new certificate and install it."
✅ Right: "Fix depends on the cert provider. ACM: re-validate DNS record, reissue. Let's Encrypt: run certbot renew, check why the cron failed. After fixing: set up monitoring to alert 30 days before expiry. Migrate to ACM — free, auto-renewing, integrated with ALB and CloudFront. This should never happen twice."
💬 The one-liner: "The real fix is the monitoring that prevents this from ever happening again."
Scenario 10: "Only users in India are affected"
❌ Wrong: "Maybe it's a server issue, let me check all servers."
✅ Right: "Region-specific narrows it significantly. Check: (1) CDN — CloudFront Indian edge PoPs degraded? (2) DNS — Route 53 latency routing sending India to a bad origin? (3) ISP — Jio/Airtel having routing issues? Check DownDetector. Outside our control but we should communicate. (4) Region health — ap-south-1 specific resources unhealthy? Start with CDN metrics filtered by geography."
📖 This is 10 of 30 scenarios.
The full DevOps Interview Playbook includes 10 more system design scenarios, 10 K8s/CI/CD debugging scenarios, salary data, resume templates, and the mental models that separate junior from senior answers.
Get All 30 Scenarios — $15Frequently Asked Questions
What are the most common DevOps interview scenario questions?
Top 5: site down at 3 AM, AWS key leaked to GitHub, database at 95% CPU, latency spike with no deployment, developer needs production access. These test structured thinking and production awareness, not tool knowledge.
How should I prepare for DevOps scenario interviews?
Practice the 60-second framework: Clarify → Hypothesize → Investigate → Mitigate. Build real projects so you have debugging experience to reference. Study these 10 scenarios — they cover 90% of what interviewers ask.
What do interviewers look for in scenario answers?
Three things: structured approach (systematic not random), production awareness (blast radius, rollback, communication), and communication (explain thinking while working). The specific answer matters less than the process.
How are DevOps interviews different from software engineering?
DevOps interviews are scenario-based conversations, not whiteboard algorithms. They test production thinking — can you debug under pressure, do you consider rollback and blast radius, can you communicate while investigating.