Hands-On Lab

I Tested AWS DevOps Agent—Here’s What Happened When I Broke My EC2 Instance

A deliberate CPU stress test on a t2.medium, and watching Amazon’s frontier AI agent investigate the incident in real time.

📅 March 22, 2026 ⏱ 8 min read ☁️ AWS · DevOps · AI

Everyone talks about AI transforming DevOps. But how well does it actually work when your infrastructure is on fire? I decided to find out by deliberately breaking an EC2 instance and letting AWS DevOps Agent figure out what went wrong—without any hints from me.

What is AWS DevOps Agent?

AWS DevOps Agent (public preview) is Amazon’s “frontier agent”—an autonomous AI system that investigates production incidents the way an experienced on-call engineer would. It connects to CloudWatch, logs, code repos, CI/CD pipelines, and third-party observability tools, then correlates data across all of them to find root causes. Free during preview: 20 investigation hours/month, us-east-1 only.

Key Concepts

Before diving in, here’s how DevOps Agent is organized:

Agent Space

A logical container that defines what the agent can access. You organize them by application, team, or on-call rotation. Each space has its own AWS account configs, integrations, and permissions.

Telemetry

Data sources the agent queries during investigations. Built-in: CloudWatch. Add-ons: Dynatrace, Datadog, New Relic, Splunk. This is the raw observability data the agent reasons over.

Communications

Integrations for routing findings to your team—Slack channels, ServiceNow tickets, PagerDuty. The agent pushes observations, root causes, and mitigation steps through these automatically.

MCP Server

Model Context Protocol—connect your own custom tools, proprietary ticketing systems, or open-source observability (Grafana, Prometheus) so the agent can query them during investigations.

Investigation

The core workflow. Describe the incident, give a starting point and timestamp, and the agent autonomously gathers metrics, builds topology, correlates signals, and produces a root cause analysis with mitigation steps.

Prevention

Weekly evaluation that analyzes patterns across past investigations and generates proactive recommendations—covering observability gaps, infrastructure optimizations, pipeline improvements, and app resilience.

Skills

Custom runbooks or playbooks you upload to guide investigations. These act as pre-loaded hints for your specific applications, improving investigation quality for known failure modes.

Mitigation Plan

After identifying root cause, the agent produces specific actions to resolve the incident, validate success, and revert if needed. Critically, it won’t recommend actions if it doesn’t have enough evidence.

Setting Up

I created an Agent Space, connected my AWS account, and enabled the web app with an auto-created IAM role.

Enable web app
Enabling the DevOps Agent web app with auto-created IAM role
Capabilities
Capabilities: Primary AWS account connected with valid status

The Capabilities page also lets you add Telemetry sources (Dynatrace, Datadog) and Pipeline sources (GitHub, GitLab) for deeper investigation context:

Telemetry and Pipeline
Telemetry and Pipeline integration options — supports Dynatrace, Datadog, GitHub, GitLab

Communications and MCP Server integrations let the agent route findings to Slack/ServiceNow and connect to custom tools:

Communications and MCP
Communications (Slack, ServiceNow) and MCP Server for custom tool integrations

I left all optional integrations empty to test the agent’s out-of-box capabilities. The Skills tab was also empty—no custom runbooks:

Skills
Skills tab — add custom runbooks to guide investigations

Prevention runs weekly and generates recommendations after you’ve completed investigations:

Prevention
Prevention tab — weekly evaluation cycle for proactive recommendations

The Incident Response Dashboard

Dashboard
Incident Response Dashboard — investigation input, daily frequency chart, and chat sidebar

Clean interface: describe your investigation, use quick-start buttons (Latest alarm, High CPU usage, Error rate spike), and a chat sidebar for natural language infrastructure queries.

Creating Chaos

I launched a t2.medium EC2 instance (“demo”) in us-east-1, connected via SSM Session Manager, and ran the stress utility:

stress --cpu 2 --timeout 300   # Maximum chaos
stress --cpu 1 --timeout 300   # Sustained single-CPU spike
Terminal
Running stress commands via SSM — 4 CPU hogs first, then single worker
Stress running
stress --cpu 1 running on the t2.medium instance

A t2.medium has 2 vCPUs with 20% baseline. Running stress pushes it well above baseline, burns CPU credits, and creates a clear CloudWatch signal.

The Investigation

I ran two investigations. The targeted one: “High CPU Usage on one of the EC2 Instances in us-east-1” with exact timestamp and starting points.

Investigation dialog
Starting investigation with specific EC2 context and incident timestamp

The agent immediately found the instance, noted it launched ~15 minutes before the incident, and began gathering metrics in parallel:

Instance found
Agent identified i-0e207cff4e380c809 (demo, t2.medium) — launched 15 min before incident

What the Agent Found

CPU utilization charts showed rapid escalation: 2.8% → 28.3% → 58.5% average across three 5-minute windows. Maximum hit 100% at 14:05 UTC.

CPU graphs
Average CPU escalated from 2.8% to 58.5%; Maximum hit 99.4% and 100%

CPU credits were burning at 2.9x the earn rate—5.83 credits per 5-minute window vs. 0.4/min earned:

Credit chart
CPUCreditUsage peaks correlate with CPU spikes — burning credits 2.9x faster than earned

But the agent didn’t stop at CPU. It correlated across multiple signals simultaneously:

Network I/O — 75MB ingress during bootstrap, dropped before CPU spike → not network-driven
EBS Disk I/O — boot activity settled before spike → disk was a result, not a cause
Status Checks — all passing, no hardware failures
CWAgent — none found, flagging a per-process visibility gap

Observations
Six observations: CPU spike, credit decline, network correlation, EBS I/O, status checks, CWAgent gap

The Mitigation Plan

This is where the agent showed real engineering judgment. The verdict: “No mitigation action can be identified.”

Mitigation plan
Agent refused unsafe mitigation — identified three evidence gaps and provided concrete next steps

That’s not a failure—it’s the smartest outcome. The agent found a strong correlation between SSM sessions and the CPU spike, but couldn’t safely recommend action because of three gaps:

1. SSM Session Manager logging wasn’t configured—shell commands invisible
2. CloudWatch Agent wasn’t publishing per-process metrics yet
3. IAM permissions prevented running top or ps aux via SSM RunCommand

Instead of blindly suggesting “kill the process,” the agent gave specific next steps: run ps aux --sort=-%cpu via the console, configure SSM logging, and coordinate with the root user (it even identified the IP address).

Resolution

After the stress test completed, the agent detected the issue resolving:

Resolution
CPU dropped after 14:10, credits healthy, no mutative actions — workload subsided

CPU dropped significantly after 14:10
Credits healthy—no exhaustion
No mutative actions detected
Workload confirmed subsided

The Verdict

Autonomous correlation is the killer feature. The agent didn’t just look at CPU—it simultaneously analyzed network, disk, credits, and status checks, building a coherent timeline explaining why the CPU spiked and when it started relative to instance launch.

It thinks like an engineer. The distinction between “disk I/O was a result of the CPU process, not a cause” showed real analytical reasoning, not just threshold alerting.

It knows what it doesn’t know. The agent flagged the CWAgent gap, refused unsafe mitigation when evidence was insufficient, and told you exactly what was missing and how to get it—including the specific commands to run and the IP of the user who launched the instance.

Bottom line: AWS DevOps Agent is a genuine shift from observability dashboards to operational reasoning. It won’t replace engineers, but it dramatically compresses the time between “something is wrong” and “here’s exactly what happened.” Worth trying during the free preview.

Usage
After 2 investigations: 13 min of 20 hours used, 998/1000 chat requests remaining
AWS DevOps Agent Incident Response CloudWatch EC2 AIOps Frontier Agent