AWS LAUNCH AI AGENT SRE

AWS DevOps Agent — The Complete Guide (2026 GA Launch, Pricing, Architecture, Limits)

Q: What is AWS DevOps Agent?

AWS DevOps Agent is an autonomous operations teammate from AWS that went generally available on March 31, 2026. It investigates incidents, proactively evaluates systems to prevent incidents, and executes on-demand SRE tasks like log queries, deployment checks, and runbook execution. It works across AWS, Azure, and on-prem environments via the Model Context Protocol (MCP), reducing MTTR from hours to minutes.

Q: How is AWS DevOps Agent priced?

Per-second billing for the time the agent actively spends on tasks. No charge when idle. New customers get a 2-month free trial with 10 agent spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand SRE tasks per month. AWS Support customers receive monthly credits based on their support tier.

Q: What is the difference between AWS DevOps Agent and Amazon Q Developer?

Q Developer is for development — code generation, code reviews, unit tests, documentation. DevOps Agent is for operations — incident triage, reliability evaluation, and SRE runbook execution. They are complementary. Q Developer helps you write the code. DevOps Agent helps you keep it running in production.

Q: Does AWS DevOps Agent replace SRE engineers?

No. It automates the repetitive parts of incident triage and preventive evaluation — first-line investigation, correlating metrics and logs, surfacing likely causes. Humans still decide the fix, approve the change, and handle anything novel or cross-team. The realistic outcome is that SREs spend less time on pages and more time on reliability engineering.

Q: Which regions support AWS DevOps Agent?

Six regions at GA: US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Asia Pacific (Sydney), and Asia Pacific (Tokyo). More regions expected over the following quarters.

Q: Can AWS DevOps Agent investigate Azure or on-prem workloads?

Yes. Azure support is built-in at GA. On-prem applications are supported via the Model Context Protocol (MCP), which lets the agent call custom skills that reach into your on-prem tooling securely. This makes it one of the few AWS services explicitly designed for multicloud and hybrid from day one.

Q: What integrations does AWS DevOps Agent support at launch?

CloudWatch (native), Datadog, Dynatrace, New Relic, Splunk, GitHub, GitLab, ServiceNow, Slack, Azure, Azure DevOps, PagerDuty, and Grafana. The integration list grows through custom agent skills that extend capability to any tool with an API.

By Akshay Ghalme·April 20, 2026·18 min read

🔥 0 engineers found this useful

On March 31, 2026, AWS moved DevOps Agent from preview to general availability. Not another chatbot wrapper around CloudWatch. An actual autonomous agent that triages incidents, evaluates your systems for reliability risks before they page you, and executes on-demand SRE tasks — across AWS, Azure, and on-prem environments. It reduces MTTR from hours to minutes for the kind of incidents SREs spend most of their time on. This guide covers every corner: what it does, how the agent architecture works, the full pricing model, regional availability, every integration at launch, a walkthrough of a real investigation, the limits nobody talks about, and what this actually means for DevOps careers.

If you want the incident-response fundamentals that still matter — even with agents doing the triage — read 10 Real-World DevOps Incident Scenarios — How Senior Engineers Answer first. The Agent handles step one. You still need to know steps two through ten.

What AWS DevOps Agent Actually Is

AWS DevOps Agent is an autonomous "operations teammate" — AWS's framing, and for once the marketing isn't wrong. It is a single AI agent that can do three categorically different things:

Investigations — when something is on fire right now. Pulls metrics, logs, traces, and deployment history, forms hypotheses, and narrows the blast to a likely root cause while you read the page.
Evaluations — when nothing is on fire yet. Runs proactive reliability checks against your systems, finds the kind of latent issues that become Sunday-night pages, and produces a prioritised list of fixes.
On-demand SRE tasks — everything in between. "Show me all pods in CrashLoopBackOff across prod clusters", "Run the runbook for the payment-service degraded state", "Correlate the 4xx spike in ALB with recent deploys".

It operates across AWS natively, Azure with first-class support added at GA, and on-prem environments through the Model Context Protocol (MCP) — meaning it can talk to tools behind your firewall through the same pattern Anthropic popularised.

The Problem This Actually Solves

Production incidents follow a depressingly consistent shape. An alert fires. Someone on-call opens CloudWatch, pulls up the dashboard, checks the deploy log, opens a terminal, starts running ad-hoc queries, joins the war room Slack channel, and spends the first 30–60 minutes doing the same things they've done for every previous incident.

That 30–60 minutes is the triage phase. It is not where the expertise lives. It is where the tedium lives. And it is where MTTR is burned.

🧠 The observation that makes the Agent valuable: the first hour of most incidents is pattern-matching against things you've seen before. "ALB 5xx + recent deploy = roll back." "DB connections maxed + traffic spike = scale pool." The Agent does the pattern-match and presents the hypothesis. You decide if the hypothesis is right and what to do.

How the Agent Architecture Actually Works

Under the hood, DevOps Agent is a classic LLM-agent pattern — a reasoning loop that plans, calls tools, observes results, and iterates — wrapped around a curated set of AWS-native and partner integrations, with MCP as the extensibility layer.

flowchart TB
  T[Trigger — Alert, chat invocation,
scheduled evaluation]:::trig --> A[DevOps Agent
Reasoning Loop]:::agent
  A --> P[Plan — what to check
in what order]:::plan
  P --> TOOLS{Tool Calls}:::tools
  TOOLS --> CW[CloudWatch
metrics / logs / traces]:::aws
  TOOLS --> AZ[Azure Monitor
App Insights]:::ext
  TOOLS --> OBS[Datadog / Splunk /
New Relic / Grafana]:::ext
  TOOLS --> CHG[GitHub / GitLab
deploy history]:::ext
  TOOLS --> TIX[ServiceNow /
PagerDuty]:::ext
  TOOLS --> MCP[Custom Skills
via MCP — on-prem tools]:::mcp
  CW --> O[Observations]:::obs
  AZ --> O
  OBS --> O
  CHG --> O
  TIX --> O
  MCP --> O
  O --> A
  A --> R[Root-cause hypothesis +
remediation recommendation]:::result
  R --> H[Human decides:
approve / reject / modify]:::human
  classDef trig fill:#4A1DB5,stroke:#9B7BF7,color:#fff
  classDef agent fill:#00B893,stroke:#00D4AA,color:#0F0F1A
  classDef plan fill:#1A1A2E,stroke:#6C3CE1,color:#c4b5fd
  classDef tools fill:#2A2A3E,stroke:#6C3CE1,color:#fff
  classDef aws fill:#6C3CE1,stroke:#9B7BF7,color:#fff
  classDef ext fill:#4A1DB5,stroke:#9B7BF7,color:#fff
  classDef mcp fill:#047857,stroke:#00D4AA,color:#fff
  classDef obs fill:#2A2A3E,stroke:#6C3CE1,color:#c4b5fd
  classDef result fill:#00B893,stroke:#00D4AA,color:#0F0F1A
  classDef human fill:#6C3CE1,stroke:#9B7BF7,color:#fff

The Reasoning Loop

The agent doesn't run a fixed script. For each task, it:

Forms a plan — "this looks like an ALB 5xx spike; start with target-group health, then check recent deploys, then look at DB connection count."
Calls tools — each check is a structured call to CloudWatch, a partner integration, or a custom MCP skill.
Observes the results — feeds them back into context and reconsiders the plan.
Iterates until a hypothesis converges or the confidence bar is hit — at which point it writes up what it found and recommends a next step.

The MCP Extensibility Layer

The Model Context Protocol support is the underrated piece. MCP lets you expose any internal tool — a legacy deploy system, a homegrown feature-flag service, an on-prem database admin console — as a "skill" the agent can call. You write the MCP server wrapper once; the agent discovers the capability at runtime.

This is how DevOps Agent manages to be useful in environments that are not 100% AWS-native, which is most real environments.

The Three Modes in Detail

1. Investigations

The investigation flow is what gets shown in every demo, and for good reason — it's where the value is most visible. A CloudWatch alarm fires. The agent is triggered automatically (or you mention it in Slack with a question). It:

Pulls the alarm context and the underlying metric
Checks dependencies — database CPU, upstream/downstream service health, network reachability
Correlates with recent deployments from GitHub / GitLab / CodePipeline
Reads structured logs, applies log anomaly detection patterns
Writes up a "here's what I think happened and why" summary with links to every piece of evidence

The output is not a fix applied. The output is a hypothesis. You approve the recommended action (usually a rollback or a scale-up), or you reject it and do your own thing.

2. Evaluations

Evaluations are the less-discussed but arguably more useful mode. Scheduled or on-demand runs where the agent proactively looks for latent reliability issues:

Workloads with no rollback configured in CodeDeploy
ALBs without health check tuning matching application startup time
RDS instances approaching storage exhaustion in the next 90 days at current growth rate
EKS clusters where critical workloads have no resource requests, making evictions unpredictable
Lambda functions where memory is over-provisioned (cost) or timeouts too low (reliability)

The output is a triaged list: issue, impact estimate, recommended fix, link to the evidence.

3. On-Demand SRE Tasks

The "chat with your infra" mode. This is where you DM the agent something like:

"What was deployed to payment-service in the last 2 hours?"
"Which pods are restarting in eks-prod-01?"
"Show me all EC2 instances with high burst credit usage"
"Run the payment-degraded runbook and walk me through the steps"

The value here is not replacing a skilled SRE. It is compressing the "where is the right dashboard / which CLI command / where is the runbook doc" overhead that eats 40% of a senior SRE's day.

Multicloud — Real, Not Marketing

A lot of AWS launches claim "multicloud" and mean "you can run our thing on EC2 and then point it at Azure." DevOps Agent is different. Azure support was in the GA launch — not a future roadmap item — and the integration depth is meaningful:

Azure Monitor metrics and alerts as first-class data sources
Application Insights correlation
Azure DevOps for deploy history
Cross-cloud incident correlation (application in AWS calling an API in Azure — the agent follows the trace across both)

On-prem support is via MCP, which is the right decision. Rather than force a weird AWS-hosted agent into your corporate network, AWS lets you expose your on-prem tools through an MCP server that the cloud-side agent calls. Security stays on your side.

Pricing — The Breakdown Most Posts Skip

The pricing model is per-second billing for active agent time. You are charged only for the seconds the agent is actually working on your task — not for the time it sits idle waiting for the next page.

✅ New customer free trial: 2 months from your first task. Each month includes up to 10 agent spaces, 20 hours of investigations, 15 hours of evaluations, and 20 hours of on-demand SRE tasks. That is enough to pilot it against a real service and see if it pays off before spending a dollar.

Dimension	Billing basis	Trial allowance / month
Investigations	Per-second, only while actively running	20 hours
Evaluations	Per-second, only during evaluation	15 hours
On-demand SRE tasks	Per-second, only during task execution	20 hours
Agent spaces	Organisational unit for access + config	10 spaces
Idle time	Not billed	Unlimited

AWS Support customers receive monthly DevOps Agent credits scaled to their support tier — Developer, Business, and Enterprise tiers each get a larger baseline of free agent hours, which is AWS's pragmatic way of saying "if you're already paying us for support, the agent replaces some of that."

⚠️ Cost-control gotcha: evaluations can run indefinitely if you schedule them across many services. Set a budget alert on DevOps Agent spend and scope evaluations to your highest-revenue or highest-risk services first. You don't need the agent scanning your dev sandbox every hour.

Regions at GA

Six regions at launch, covering the three big geographies AWS cares about:

Geography	Region	Region code
North America	US East (N. Virginia)	us-east-1
North America	US West (Oregon)	us-west-2
Europe	Europe (Frankfurt)	eu-central-1
Europe	Europe (Ireland)	eu-west-1
Asia Pacific	Asia Pacific (Sydney)	ap-southeast-2
Asia Pacific	Asia Pacific (Tokyo)	ap-northeast-1

Notable absences: Mumbai (ap-south-1), São Paulo, London. Expect those in subsequent rollouts, but if your workload is regulated to India, you'll need to wait or route through Tokyo.

Integrations at Launch

The integration story is unusually strong for a day-one release:

Observability & Monitoring

Amazon CloudWatch (native)
Datadog
Dynatrace
New Relic
Splunk
Grafana
Azure Monitor + Application Insights

Code & Deploy

GitHub
GitLab
Azure DevOps

Incident & Ticketing

ServiceNow
PagerDuty
Slack (interactive)

Custom (via MCP)

Any tool you can wrap in an MCP server — internal APIs, databases, on-prem monitoring, homegrown dashboards

A Real Investigation Walkthrough

Imagine it's 2 AM. PagerDuty fires on payment-service high error rate. DevOps Agent is wired to that alert. Here is what happens without you touching anything:

0:00–0:10 — Agent pulls alert context. Identifies affected service = payment-service. Baseline error rate = 0.08%. Current = 4.2%. Delta is significant.
0:10–0:45 — Checks recent deploys via GitHub integration. Finds deploy at 01:47 (13 min before alert). High-confidence correlation.
0:45–1:20 — Pulls diff of deploy. Notes a change in the retry policy for the upstream fraud-check API — timeout was lowered from 3s to 800ms.
1:20–2:10 — Correlates against Datadog traces. Confirms fraud-check p95 is 1.4s, meaning the new timeout fires on ~60% of requests.
2:10–2:40 — Writes up: "High-confidence root cause — deploy at 01:47 lowered fraud-check timeout below upstream p95. Recommended action: rollback. Risk of waiting: revenue impact scales with traffic."
2:40 — Posts the writeup into Slack with a rollback button. You wake up, read for 30 seconds, approve the rollback.

What happened here: the agent did the 40-minute part in under 3 minutes. You made the call. MTTR dropped from 45+ minutes to under 5.

DevOps Agent vs Q Developer vs CloudWatch Investigations

There is real confusion about what fits where. Here is the honest breakdown:

Tool	Designed for	When to reach for it
Amazon Q Developer	Code — generation, reviews, unit tests, docs, migrations	Writing or modifying application code in an IDE or repo
AWS DevOps Agent	Operations — incident triage, reliability evaluation, SRE tasks	Anything that happens after code is running in prod
CloudWatch Investigations (legacy)	Single-metric drill-down and anomaly detection	Manual investigation of a specific metric where you already know the scope
Traditional on-call + dashboards	Novel incidents, cross-team coordination, business decisions	When the problem is genuinely new or requires org alignment

Q Developer and DevOps Agent are complementary — Q writes the retry logic, DevOps Agent tells you that your retry logic is eating your error budget in prod.

The Real Risks and Limitations

Every vendor page on DevOps Agent makes it sound perfect. It is not. Here are the honest limitations worth knowing before you build critical workflows around it.

⚠️ Hallucinated correlations. LLM agents are pattern-matchers. If your recent deploy is coincidental with an incident caused by an upstream DNS failure, the agent will still fixate on the deploy because that pattern is over-represented in its training. Treat hypotheses as leads, not conclusions.

⚠️ Blast radius of automated actions. If you wire the agent to auto-execute remediations (instead of only recommending them), you need tight guardrails. A confidently wrong rollback at 3 AM is still a rollback. Start with human-approved-only workflows for at least the first quarter.

⚠️ Blind spots in non-AWS systems. Azure support is real, but not yet as deep as AWS. For pure Azure workloads you may get better results from Azure-native tooling. For hybrid, DevOps Agent is best-in-class.

⚠️ Cost creep from evaluations. See the pricing section — scheduled evaluations can quietly rack up agent-hours if scoped too broadly. Always set budgets and start narrow.

⚠️ Regulated environments. Some compliance frameworks prohibit AI-driven automated decisions on production systems. Check your SOC 2, HIPAA, PCI-DSS auditor's stance before wiring the agent to anything customer-facing.

What This Means for DevOps and SRE Careers

This is the question everyone asks: does this replace SRE jobs?

Short honest answer: no, but it changes which SRE skills are valuable.

The parts of the job that the agent compresses:

First-line triage on common incidents
Correlation across disparate data sources
Runbook execution for well-documented procedures
The "which dashboard, which CLI command" overhead

The parts of the job that get more valuable, not less:

Designing systems that are observable enough for an agent to reason about them — if the agent can't find the metric, the agent can't help
Writing quality runbooks the agent can execute
Building the MCP skills that expose your internal tools
Post-incident review, capacity planning, reliability strategy
Novel, cross-team, or genuinely ambiguous incidents — the 10% of incidents that are 90% of the institutional value

The realistic outcome over the next two years: fewer "junior on-call" hires, more "senior reliability engineer" hires. If you are early in your DevOps career, the fastest hedge is to learn how to build and integrate with agents — not to run from them.

For the broader career framing, see Platform Engineering vs DevOps vs SRE — 2026 Career Guide.

Interview Angle — The 2026 Version of "Walk Me Through an Incident"

"The site is down at 3 AM, walk me through your first 10 minutes" is still the most-asked DevOps interview question. The 2026 answer acknowledges the agent without leaning on it:

💡 Good 2026 answer: "I check the scope — is this all users or a subset, any recent deploys. I invoke the DevOps Agent on the affected service in parallel so it starts pulling metrics, deploy history, and log anomalies while I'm clarifying. Within 2–3 minutes I have both my own read and the agent's hypothesis. If they agree, I act. If they disagree, the disagreement itself is information — something weird is happening that doesn't match the common pattern."

What this signals to the interviewer: you know the tool exists, you use it to compress triage, but you don't outsource judgement to it. You understand that the agent is a fast second opinion, not a replacement for thinking.

If you want the rest of the scenario-question playbook — 30 real interview scenarios with wrong answer vs right answer — that's in the DevOps Interview Playbook ($15).

How to Actually Start Using It

If you're running AWS workloads in one of the supported regions, here's the minimal-effort pilot:

Pick one production service. Not your most critical one. Something that pages, but where a mis-triage won't cost the company.
Enable DevOps Agent in the region and wire it to the CloudWatch alarms and Slack channel for that service.
Run in "recommendation only" mode. The agent proposes, humans approve every action.
After 2–4 weeks, measure. Compare MTTR on agent-assisted incidents vs not. Look at how often the agent's hypothesis matched the post-mortem root cause.
Scope up gradually. Only expand to more services or more automated action once you have real numbers.

This is the same pattern you'd apply to any new ops tool — the only twist is that this one is reasoning, so you need to audit the reasoning, not just the actions.

Was this useful?

Frequently Asked Questions

What is AWS DevOps Agent?

An autonomous operations teammate from AWS, GA since March 31, 2026. Handles incident investigations, proactive reliability evaluations, and on-demand SRE tasks across AWS, Azure, and on-prem environments (the last via the Model Context Protocol).

How is AWS DevOps Agent priced?

Per-second billing for active agent time. No charge when idle. A 2-month free trial for new customers includes 20h investigations, 15h evaluations, and 20h on-demand SRE tasks per month. AWS Support tiers include monthly agent credits.

What is the difference between AWS DevOps Agent and Amazon Q Developer?

Q Developer is for writing and reviewing code. DevOps Agent is for running it in production. They are complementary — Q at the IDE, DevOps Agent at the pager.

Does AWS DevOps Agent replace SRE engineers?

No. It automates the repetitive 40% of the job — first-line triage, correlation, runbook execution. The other 60% (novel incidents, observability design, reliability strategy, post-mortems) becomes more valuable, not less.

Which regions support AWS DevOps Agent?

Six regions at GA: us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-2, ap-northeast-1. More regions expected through 2026.

Can AWS DevOps Agent investigate Azure or on-prem workloads?

Yes. Azure support is first-class at GA. On-prem is via MCP — you expose your internal tools as MCP skills that the agent can call from AWS.

What integrations does AWS DevOps Agent support at launch?

CloudWatch, Datadog, Dynatrace, New Relic, Splunk, Grafana, GitHub, GitLab, ServiceNow, Slack, Azure Monitor, Azure DevOps, PagerDuty. Anything else is addressable via custom MCP skills.

Is it safe to let AWS DevOps Agent take automated actions?

Only with careful guardrails. Start in recommendation-only mode for at least 4–8 weeks. Wire a human approval into any remediation. Never wire it to irreversible destructive actions without a second factor.

📖 The Agent handles triage. You handle the hard parts.

DevOps Agent compresses first-line triage. The incidents that actually test an engineer are the ones the agent can't pattern-match. The DevOps Interview Playbook has 30 of those — production incidents with the wrong answer most candidates give and the right answer senior engineers give.

Get the Playbook — $15

Related Guides

INCIDENT

10 DevOps Incident Scenarios

Wrong answer vs right answer for every common scenario.

CAREER

Platform Eng vs DevOps vs SRE

Which role to pick in 2026 with real salary data.

Akshay Ghalme

AWS DevOps Engineer with 3+ years building production cloud infrastructure. AWS Certified Solutions Architect.

LinkedIn GitHub X Instagram YouTube Portfolio

Let's Connect