I Tested AWS DevOps Agent — Here's What Happened

Everyone talks about AI transforming DevOps. But how well does it actually work when your infrastructure is on fire? I decided to find out by deliberately breaking an EC2 instance and letting AWS DevOps Agent figure out what went wrong—without any hints from me.

What is AWS DevOps Agent?

AWS DevOps Agent (public preview) is Amazon’s “frontier agent”—an autonomous AI system that investigates production incidents the way an experienced on-call engineer would. It connects to CloudWatch, logs, code repos, CI/CD pipelines, and third-party observability tools, then correlates data across all of them to find root causes. Free during preview: 20 investigation hours/month, us-east-1 only.

Key Concepts

Before diving in, here’s how DevOps Agent is organized:

Agent Space

A logical container that defines what the agent can access. You organize them by application, team, or on-call rotation. Each space has its own AWS account configs, integrations, and permissions.

Telemetry

Data sources the agent queries during investigations. Built-in: CloudWatch. Add-ons: Dynatrace, Datadog, New Relic, Splunk. This is the raw observability data the agent reasons over.

Communications

Integrations for routing findings to your team—Slack channels, ServiceNow tickets, PagerDuty. The agent pushes observations, root causes, and mitigation steps through these automatically.

MCP Server

Model Context Protocol—connect your own custom tools, proprietary ticketing systems, or open-source observability (Grafana, Prometheus) so the agent can query them during investigations.

Investigation

The core workflow. Describe the incident, give a starting point and timestamp, and the agent autonomously gathers metrics, builds topology, correlates signals, and produces a root cause analysis with mitigation steps.

Prevention

Weekly evaluation that analyzes patterns across past investigations and generates proactive recommendations—covering observability gaps, infrastructure optimizations, pipeline improvements, and app resilience.

Skills

Custom runbooks or playbooks you upload to guide investigations. These act as pre-loaded hints for your specific applications, improving investigation quality for known failure modes.

Mitigation Plan

After identifying root cause, the agent produces specific actions to resolve the incident, validate success, and revert if needed. Critically, it won’t recommend actions if it doesn’t have enough evidence.

Setting Up

I created an Agent Space, connected my AWS account, and enabled the web app with an auto-created IAM role.

Enable web app — Enabling the DevOps Agent web app with auto-created IAM role

The Capabilities page also lets you add Telemetry sources (Dynatrace, Datadog) and Pipeline sources (GitHub, GitLab) for deeper investigation context:

Telemetry and Pipeline integration options — supports Dynatrace, Datadog, GitHub, GitLab

Communications and MCP Server integrations let the agent route findings to Slack/ServiceNow and connect to custom tools:

I left all optional integrations empty to test the agent’s out-of-box capabilities. The Skills tab was also empty—no custom runbooks:

Prevention runs weekly and generates recommendations after you’ve completed investigations:

The Incident Response Dashboard

Clean interface: describe your investigation, use quick-start buttons (Latest alarm, High CPU usage, Error rate spike), and a chat sidebar for natural language infrastructure queries.

Creating Chaos

I launched a t2.medium EC2 instance (“demo”) in us-east-1, connected via SSM Session Manager, and ran the stress utility:

stress --cpu 2 --timeout 300   # Maximum chaos
stress --cpu 1 --timeout 300   # Sustained single-CPU spike

Terminal — Running stress commands via SSM — 4 CPU hogs first, then single worker

Stress running — stress --cpu 1 running on the t2.medium instance

A t2.medium has 2 vCPUs with 20% baseline. Running stress pushes it well above baseline, burns CPU credits, and creates a clear CloudWatch signal.

The Investigation

I ran two investigations. The targeted one: “High CPU Usage on one of the EC2 Instances in us-east-1” with exact timestamp and starting points.

Investigation dialog — Starting investigation with specific EC2 context and incident timestamp

The agent immediately found the instance, noted it launched ~15 minutes before the incident, and began gathering metrics in parallel:

Instance found — Agent identified i-0e207cff4e380c809 (demo, t2.medium) — launched 15 min before incident

What the Agent Found

CPU utilization charts showed rapid escalation: 2.8% → 28.3% → 58.5% average across three 5-minute windows. Maximum hit 100% at 14:05 UTC.

CPU graphs — Average CPU escalated from 2.8% to 58.5%; Maximum hit 99.4% and 100%

CPU credits were burning at 2.9x the earn rate—5.83 credits per 5-minute window vs. 0.4/min earned:

Credit chart — CPUCreditUsage peaks correlate with CPU spikes — burning credits 2.9x faster than earned

But the agent didn’t stop at CPU. It correlated across multiple signals simultaneously:

Network I/O — 75MB ingress during bootstrap, dropped before CPU spike → not network-driven
EBS Disk I/O — boot activity settled before spike → disk was a result, not a cause
Status Checks — all passing, no hardware failures
CWAgent — none found, flagging a per-process visibility gap

Observations — Six observations: CPU spike, credit decline, network correlation, EBS I/O, status checks, CWAgent gap

The Mitigation Plan

This is where the agent showed real engineering judgment. The verdict: “No mitigation action can be identified.”

That’s not a failure—it’s the smartest outcome. The agent found a strong correlation between SSM sessions and the CPU spike, but couldn’t safely recommend action because of three gaps:

1. SSM Session Manager logging wasn’t configured—shell commands invisible
2. CloudWatch Agent wasn’t publishing per-process metrics yet
3. IAM permissions prevented running top or ps aux via SSM RunCommand

Instead of blindly suggesting “kill the process,” the agent gave specific next steps: run ps aux --sort=-%cpu via the console, configure SSM logging, and coordinate with the root user (it even identified the IP address).

Resolution

After the stress test completed, the agent detected the issue resolving:

✓ CPU dropped significantly after 14:10
✓ Credits healthy—no exhaustion
✓ No mutative actions detected
✓ Workload confirmed subsided

The Verdict

Autonomous correlation is the killer feature. The agent didn’t just look at CPU—it simultaneously analyzed network, disk, credits, and status checks, building a coherent timeline explaining why the CPU spiked and when it started relative to instance launch.

It thinks like an engineer. The distinction between “disk I/O was a result of the CPU process, not a cause” showed real analytical reasoning, not just threshold alerting.

It knows what it doesn’t know. The agent flagged the CWAgent gap, refused unsafe mitigation when evidence was insufficient, and told you exactly what was missing and how to get it—including the specific commands to run and the IP of the user who launched the instance.

Bottom line: AWS DevOps Agent is a genuine shift from observability dashboards to operational reasoning. It won’t replace engineers, but it dramatically compresses the time between “something is wrong” and “here’s exactly what happened.” Worth trying during the free preview.

Usage — After 2 investigations: 13 min of 20 hours used, 998/1000 chat requests remaining

AWS DevOps Agent Incident Response CloudWatch EC2 AIOps Frontier Agent