How to Set Up Prometheus + Grafana on AWS EC2 with Terraform
Prometheus collects and stores metrics. Grafana visualizes them. Together, they give you complete infrastructure monitoring — CPU, memory, disk, network, and custom application metrics — with powerful alerting. This guide sets up both on AWS EC2 using Docker Compose, with Terraform for the infrastructure.
I’ve caught production issues hours before customers noticed them using this exact setup. Monitoring is not optional — it’s how you sleep at night while running production infrastructure.
What Prometheus and Grafana Actually Do
Prometheus is a pull-based monitoring system. It scrapes metrics from your applications and infrastructure at regular intervals, stores them as time-series data, and provides PromQL — a powerful query language for analyzing metrics.
Grafana is a visualization platform. It connects to Prometheus (and many other data sources), lets you build dashboards, and sends alerts when things go wrong.
Node Exporter runs on each server and exposes system metrics (CPU, memory, disk, network) that Prometheus scrapes.
Prerequisites
- AWS account with EC2 permissions
- Terraform installed locally
- Basic understanding of EC2 and security groups
- SSH key pair in your AWS region
Infrastructure with Terraform
First, create the EC2 instance with the right security group:
resource "aws_security_group" "monitoring" {
name = "monitoring-stack"
description = "Prometheus + Grafana"
vpc_id = var.vpc_id
ingress {
description = "Grafana"
from_port = 3000
to_port = 3000
protocol = "tcp"
cidr_blocks = [var.my_ip]
}
ingress {
description = "Prometheus"
from_port = 9090
to_port = 9090
protocol = "tcp"
cidr_blocks = [var.my_ip]
}
ingress {
description = "SSH"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.my_ip]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_instance" "monitoring" {
ami = "ami-0abcdef1234567890" # Ubuntu 22.04
instance_type = "t3.small"
key_name = var.key_name
vpc_security_group_ids = [aws_security_group.monitoring.id]
subnet_id = var.public_subnet_id
root_block_device {
volume_size = 30
volume_type = "gp3"
}
user_data = <<-EOF
#!/bin/bash
apt-get update -y
apt-get install -y docker.io docker-compose
systemctl enable docker
systemctl start docker
usermod -aG docker ubuntu
EOF
tags = { Name = "monitoring-stack" }
}
Important: Restrict Prometheus (9090) and Grafana (3000) to your IP only. Never expose them to 0.0.0.0/0.
Installing Prometheus
SSH into the instance and create the config files:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
Create docker-compose.yml:
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
restart: unless-stopped
volumes:
prometheus_data:
Start it: docker-compose up -d. Verify at http://<your-ip>:9090. You should see the Prometheus UI with the “prometheus” target showing as UP.
Installing Grafana
Add Grafana to your docker-compose.yml:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Restart: docker-compose up -d. Access Grafana at http://<your-ip>:3000. Login with admin / your password.
Add Prometheus as a data source: Settings → Data Sources → Add → Prometheus → URL: http://prometheus:9090 → Save & Test.
Adding Node Exporter for System Metrics
Add node-exporter to docker-compose.yml:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
restart: unless-stopped
This exposes CPU, memory, disk, filesystem, and network metrics. Prometheus scrapes it automatically (we already configured the scrape target).
Restart and verify: docker-compose up -d. Check Prometheus targets page — node-exporter should show as UP.
Building Your First Dashboard
The fastest way: import a community dashboard.
- In Grafana: Dashboards → Import
- Enter dashboard ID:
1860(Node Exporter Full) - Select your Prometheus data source
- Click Import
You instantly get CPU, memory, disk, network, and system load panels.
To create a custom CPU usage panel:
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
For memory usage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
For disk usage:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
Setting Up Alerts
In Grafana: Alerting → Alert Rules → New Alert Rule.
CPU alert example:
- Query:
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - Condition: Is above 80
- For: 5 minutes (avoids false positives from brief spikes)
- Summary: “High CPU usage on monitoring server”
Set up notification channels: Alerting → Contact Points → Add. Options: Email, Slack webhook, PagerDuty, etc.
What to alert on:
- CPU > 80% for 5 min
- Memory > 85% for 5 min
- Disk > 90%
- Instance down (up == 0)
What NOT to alert on: Brief spikes, cosmetic metrics, things you can’t act on at 3 AM. Alert fatigue kills monitoring.
Monitoring Your Application
System metrics are just the start. For application monitoring, instrument your code with Prometheus client libraries:
- Python:
prometheus_client - Node.js:
prom-client - Go:
prometheus/client_golang - Java:
micrometer
Track the 4 golden signals:
- Latency — how long requests take
- Traffic — requests per second
- Errors — error rate (5xx responses)
- Saturation — how full your system is
Production Hardening
- Reverse proxy with SSL: Put Nginx in front of Grafana with Let’s Encrypt. Never run Grafana on plain HTTP in production.
- Authentication: Disable anonymous access (
GF_AUTH_ANONYMOUS_ENABLED=false). Enable OAuth if your team uses Google/GitHub SSO. - Prometheus retention: Set
--storage.tsdb.retention.time=30d(or whatever fits your disk). Default 15 days. - Backup Grafana dashboards: Export as JSON regularly, or use Grafana’s built-in backup. Losing dashboards after weeks of tuning is painful.
- Resource limits: Add
mem_limitandcpusto Docker Compose to prevent runaway processes.
Common Mistakes to Avoid
- Alert fatigue: Alerting on everything means you ignore everything. Be selective.
- No retention limits: Prometheus will fill your disk. Always set
--storage.tsdb.retention.time. - Exposing ports to the internet: Prometheus has no built-in auth. Restrict with security groups.
- No persistent volumes: Without Docker volumes, you lose all data on container restart.
- Skipping memory metrics: CPU is not enough. Install node-exporter for the full picture.
Frequently Asked Questions
How much does it cost to run Prometheus and Grafana on AWS?
A t3.small instance runs about $15/month. Add $2.40/month for 30 GB gp3 storage. Total: under $20/month for a complete monitoring stack.
Can I use Prometheus with ECS or EKS?
Yes. For ECS, use Prometheus ECS service discovery. For EKS, the kube-prometheus-stack Helm chart is the standard — it bundles Prometheus, Grafana, and Alertmanager with Kubernetes-native service discovery.
Prometheus vs CloudWatch — which should I use?
Use CloudWatch for native AWS service metrics (RDS, ALB, Lambda). Use Prometheus for custom application metrics, PromQL queries, and multi-cloud setups. Many teams use both.
How long does Prometheus store data?
Default: 15 days. Configure with --storage.tsdb.retention.time=30d (or any duration). For long-term storage beyond months, look at Thanos or Cortex.
Is Grafana free?
Yes. Grafana OSS is fully open-source. Self-host it for free. Grafana Cloud also has a free tier (10k metrics, 50 GB logs, 50 GB traces).