Everyone talks about AI transforming DevOps. But how well does it actually work when your infrastructure is on fire? I decided to find out by deliberately breaking an EC2 instance and letting AWS DevOps Agent figure out what went wrong—without any hints from me.
What is AWS DevOps Agent?
AWS DevOps Agent (public preview) is Amazon’s “frontier agent”—an autonomous AI system that investigates production incidents the way an experienced on-call engineer would. It connects to CloudWatch, logs, code repos, CI/CD pipelines, and third-party observability tools, then correlates data across all of them to find root causes. Free during preview: 20 investigation hours/month, us-east-1 only.
Key Concepts
Before diving in, here’s how DevOps Agent is organized:
Agent Space
A logical container that defines what the agent can access. You organize them by application, team, or on-call rotation. Each space has its own AWS account configs, integrations, and permissions.
Telemetry
Data sources the agent queries during investigations. Built-in: CloudWatch. Add-ons: Dynatrace, Datadog, New Relic, Splunk. This is the raw observability data the agent reasons over.
Communications
Integrations for routing findings to your team—Slack channels, ServiceNow tickets, PagerDuty. The agent pushes observations, root causes, and mitigation steps through these automatically.
MCP Server
Model Context Protocol—connect your own custom tools, proprietary ticketing systems, or open-source observability (Grafana, Prometheus) so the agent can query them during investigations.
Investigation
The core workflow. Describe the incident, give a starting point and timestamp, and the agent autonomously gathers metrics, builds topology, correlates signals, and produces a root cause analysis with mitigation steps.
Prevention
Weekly evaluation that analyzes patterns across past investigations and generates proactive recommendations—covering observability gaps, infrastructure optimizations, pipeline improvements, and app resilience.
Skills
Custom runbooks or playbooks you upload to guide investigations. These act as pre-loaded hints for your specific applications, improving investigation quality for known failure modes.
Mitigation Plan
After identifying root cause, the agent produces specific actions to resolve the incident, validate success, and revert if needed. Critically, it won’t recommend actions if it doesn’t have enough evidence.
Setting Up
I created an Agent Space, connected my AWS account, and enabled the web app with an auto-created IAM role.
The Capabilities page also lets you add Telemetry sources (Dynatrace, Datadog) and Pipeline sources (GitHub, GitLab) for deeper investigation context:
Communications and MCP Server integrations let the agent route findings to Slack/ServiceNow and connect to custom tools:
I left all optional integrations empty to test the agent’s out-of-box capabilities. The Skills tab was also empty—no custom runbooks:
Prevention runs weekly and generates recommendations after you’ve completed investigations:
The Incident Response Dashboard
Clean interface: describe your investigation, use quick-start buttons (Latest alarm, High CPU usage, Error rate spike), and a chat sidebar for natural language infrastructure queries.
Creating Chaos
I launched a t2.medium EC2 instance (“demo”) in us-east-1, connected via SSM Session Manager, and ran the stress utility:
stress --cpu 2 --timeout 300 # Maximum chaos stress --cpu 1 --timeout 300 # Sustained single-CPU spike
A t2.medium has 2 vCPUs with 20% baseline. Running stress pushes it well above baseline, burns CPU credits, and creates a clear CloudWatch signal.
The Investigation
I ran two investigations. The targeted one: “High CPU Usage on one of the EC2 Instances in us-east-1” with exact timestamp and starting points.
The agent immediately found the instance, noted it launched ~15 minutes before the incident, and began gathering metrics in parallel:
What the Agent Found
CPU utilization charts showed rapid escalation: 2.8% → 28.3% → 58.5% average across three 5-minute windows. Maximum hit 100% at 14:05 UTC.
CPU credits were burning at 2.9x the earn rate—5.83 credits per 5-minute window vs. 0.4/min earned:
But the agent didn’t stop at CPU. It correlated across multiple signals simultaneously:
Network I/O — 75MB ingress during bootstrap, dropped before CPU spike → not network-driven
EBS Disk I/O — boot activity settled before spike → disk was a result, not a cause
Status Checks — all passing, no hardware failures
CWAgent — none found, flagging a per-process visibility gap
The Mitigation Plan
This is where the agent showed real engineering judgment. The verdict: “No mitigation action can be identified.”
That’s not a failure—it’s the smartest outcome. The agent found a strong correlation between SSM sessions and the CPU spike, but couldn’t safely recommend action because of three gaps:
1. SSM Session Manager logging wasn’t configured—shell commands invisible
2. CloudWatch Agent wasn’t publishing per-process metrics yet
3. IAM permissions prevented running top or ps aux via SSM RunCommand
Instead of blindly suggesting “kill the process,” the agent gave specific next steps: run ps aux --sort=-%cpu via the console, configure SSM logging, and coordinate with the root user (it even identified the IP address).
Resolution
After the stress test completed, the agent detected the issue resolving:
✓ CPU dropped significantly after 14:10
✓ Credits healthy—no exhaustion
✓ No mutative actions detected
✓ Workload confirmed subsided
The Verdict
Autonomous correlation is the killer feature. The agent didn’t just look at CPU—it simultaneously analyzed network, disk, credits, and status checks, building a coherent timeline explaining why the CPU spiked and when it started relative to instance launch.
It thinks like an engineer. The distinction between “disk I/O was a result of the CPU process, not a cause” showed real analytical reasoning, not just threshold alerting.
It knows what it doesn’t know. The agent flagged the CWAgent gap, refused unsafe mitigation when evidence was insufficient, and told you exactly what was missing and how to get it—including the specific commands to run and the IP of the user who launched the instance.
Bottom line: AWS DevOps Agent is a genuine shift from observability dashboards to operational reasoning. It won’t replace engineers, but it dramatically compresses the time between “something is wrong” and “here’s exactly what happened.” Worth trying during the free preview.