AWS EC2 Right-Sizing Guide — How to Find and Fix Oversized Instances
Most AWS accounts are running EC2 instances 2–4x larger than needed. Right-sizing is the single highest-impact cost optimization you can make — I’ve seen it save 40–60% on compute costs alone. This guide walks through the exact process: collecting metrics, identifying candidates, safely resizing, and monitoring after.
Why Most EC2 Instances Are Oversized
It happens the same way everywhere: someone provisions an m5.xlarge “just to be safe,” the app works fine, and nobody revisits the decision. Six months later, you’re paying for 4 vCPUs and 16 GB RAM while your app uses 8% CPU and 3 GB memory.
Common causes: fear-driven provisioning, copy-paste from Stack Overflow, “we might need it someday,” and the absence of memory metrics (EC2 doesn’t report memory to CloudWatch by default — so people guess high).
The Right-Sizing Process
Right-sizing is not a one-time task. It’s a cycle:
- Collect metrics (CPU, memory, network, disk)
- Analyze utilization over 2+ weeks
- Identify candidates using thresholds
- Test the new size in staging first
- Monitor after resizing
- Repeat quarterly
Step 1: Enable Detailed CloudWatch Monitoring
Default CloudWatch gives you 5-minute intervals. Enable detailed monitoring for 1-minute granularity:
aws ec2 monitor-instances --instance-ids i-0abc123def456
Critical: EC2 does NOT report memory metrics by default. Install the CloudWatch Agent:
sudo yum install -y amazon-cloudwatch-agent
# Create config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
Without memory metrics, you’re making decisions blind. This is the #1 mistake in right-sizing.
Step 2: Collect 2 Weeks of Data
Don’t make decisions on 1 day of data. Workloads vary by day of week, time of month, and business cycles. Collect at minimum:
- CPU Utilization — average and peak
- Memory Utilization — from CloudWatch Agent
- Network In/Out — are you hitting bandwidth limits?
- Disk Read/Write Ops — I/O-bound workloads need different instance families
Step 3: Use AWS Compute Optimizer
Enable it for free in the AWS Console: AWS Compute Optimizer → Get started. It analyzes your last 14 days of CloudWatch data and recommends instance types.
Key findings to look for:
- Over-provisioned — you’re paying for resources you don’t use
- Under-provisioned — your instance is struggling (rare, but important)
- Optimized — current size is appropriate
Compute Optimizer is a starting point, not gospel. It doesn’t know about your deployment patterns, burst behavior, or upcoming load changes.
Step 4: Identify Right-Sizing Candidates
Use these thresholds as a starting point:
| Metric | Threshold | Action |
|---|---|---|
| Avg CPU < 20%, Peak < 50% | Over-provisioned | Downsize instance type |
| Memory < 40% consistently | Over-provisioned | Consider smaller family |
| Network < 10% baseline | Over-provisioned | Smaller instance has less bandwidth |
| Avg CPU > 80% | Under-provisioned | Upsize or optimize application |
Step 5: Execute the Change
Standalone instance (stop/start required):
aws ec2 stop-instances --instance-ids i-0abc123
aws ec2 modify-instance-attribute --instance-id i-0abc123 --instance-type t3.medium
aws ec2 start-instances --instance-ids i-0abc123
Auto Scaling Group (zero downtime):
Update your launch template in Terraform:
resource "aws_launch_template" "app" {
image_id = "ami-0abcdef1234567890"
instance_type = "t3.medium" # was m5.xlarge
# ... rest of config
}
resource "aws_autoscaling_group" "app" {
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 90
}
}
}
The instance refresh gradually replaces old instances with the new type. Zero downtime.
Step 6: Monitor After Resizing
Set up CloudWatch alarms for the first week:
- CPU > 80% for 5 minutes → alarm
- Memory > 85% for 5 minutes → alarm
- Application response time > 2x baseline → alarm
Have a rollback plan: for ASG, revert the launch template. For standalone, stop and change back.
Instance Family Cheat Sheet
| Family | Best For | vCPU:Memory Ratio | Example Use |
|---|---|---|---|
| t3/t3a | Burstable, low-moderate CPU | 1:2 GB | Web servers, small apps, dev/staging |
| m5/m6i | General purpose, balanced | 1:4 GB | Application servers, mid-size databases |
| c5/c6i | Compute optimized | 1:2 GB | Batch processing, ML inference, gaming |
| r5/r6i | Memory optimized | 1:8 GB | In-memory caches, large databases |
Pro tip: t3 instances are massively underrated. For bursty workloads (which most web apps are), t3 with unlimited credits often costs less than m5 while performing identically.
Real Example: How I Right-Sized a Production Fleet
At my company, we had 12 m5.xlarge instances (4 vCPU, 16 GB) running our application servers. CloudWatch showed:
- Average CPU: 8%
- Peak CPU: 35% (during deployments)
- Memory: 4.2 GB average (26% utilization)
We moved to t3.medium (2 vCPU, 4 GB). Results:
- m5.xlarge: $0.192/hr × 12 = $2.304/hr = $1,659/month
- t3.medium: $0.0416/hr × 12 = $0.499/hr = $359/month
- Savings: $1,300/month (78%)
No performance impact. The t3 burst credits easily handled deployment spikes. We monitored for 2 weeks before committing.
Common Mistakes to Avoid
- Sizing for peak only — if peak is 50% but happens 1% of the time, you’re wasting 99% of the time
- Forgetting memory metrics — CPU looks fine but memory is at 90%? Don’t downsize
- Skipping staging tests — always test the new size with realistic load first
- Ignoring burstable instances — t3 is perfect for 80%+ of web workloads
- One-time exercise — right-sizing is quarterly, not annual
Frequently Asked Questions
How often should I right-size EC2 instances?
Quarterly at minimum. Workloads change, and AWS launches new instance types regularly. Set a calendar reminder.
Can I right-size without downtime?
For Auto Scaling Groups, yes — use instance refresh with rolling updates. For standalone instances, you need a brief stop/start (typically under 2 minutes).
What if my app is memory-bound, not CPU-bound?
Install the CloudWatch Agent first. EC2 doesn’t report memory by default. Once you have memory data, you can make informed decisions about whether to change instance family (e.g., m5 → r5 for memory-heavy workloads, or m5 → t3 if memory is low).
Should I use Spot Instances instead of right-sizing?
Both. Right-size first to find the correct instance type, then evaluate Spot for fault-tolerant workloads. Right-sizing is risk-free; Spot requires handling interruptions.
Does right-sizing affect Reserved Instance savings?
Yes. Always right-size first, then purchase RIs for the correct size. Buying RIs for oversized instances locks in waste for 1–3 years.