AWS DevOps Cheatsheet
The single-page reference you bookmark and keep open in a tab. Copy-paste-ready IAM policies, Terraform snippets, kubectl commands, CLI one-liners, cost tips, and the errors you'll actually hit in production.
Written by an AWS DevOps engineer for AWS DevOps engineers. Free, printable, no signup. Updated as things break in real infra.
🔐 IAM Policies
The five templates you'll copy 90% of the time. Replace ARNs, accounts, and conditions to your context.
Least-privilege S3 read on one prefix
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/reports/*" ] }] }
Cross-account AssumeRole with ExternalId (confused-deputy safe)
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:root"},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {"sts:ExternalId": "unique-per-vendor-value"}
}
}]
}
Enforce MFA on privileged actions
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["iam:*", "kms:Decrypt"],
"Resource": "*",
"Condition": {
"Bool": {"aws:MultiFactorAuthPresent": "true"}
}
}]
}
Deny dangerous wildcards via SCP (Org-wide guardrail)
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": ["iam:DeleteRole", "iam:DeleteUser", "cloudtrail:StopLogging"],
"Resource": "*"
}]
}
Permissions boundary for developer roles
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"NotAction": ["iam:*", "organizations:*", "account:*"],
"Resource": "*"
}]
}
🛡 Security Groups
Stateful, allow-only. Stack rules tightly. Never open 22/3389/database ports to 0.0.0.0/0.
Rules you should actually use
| Purpose | From | Port | Source |
|---|---|---|---|
| HTTPS from internet | ALB SG | 443 | 0.0.0.0/0 |
| HTTP redirect | ALB SG | 80 | 0.0.0.0/0 → redirect to 443 |
| App from ALB | App SG | 8080 | ALB SG id |
| DB from app | DB SG | 5432 | App SG id |
| SSH admin | EC2 SG | 22 | SSM Session Manager (no SG rule needed) |
Rules to NEVER use
22 / 0.0.0.0/0— SSH to world = brute-force bot heaven3389 / 0.0.0.0/0— RDP to world = ransomware vector5432 / 0.0.0.0/0or any DB port — credential theft-1 / 0.0.0.0/0egress — allows data exfiltration from compromised host- Any SG reference to
0.0.0.0/0without a time-bound break-glass rule
⚙ Terraform Snippets
The blocks you paste into every new project. Pre-configured for security + encryption.
Remote state with S3 + DynamoDB lock + encryption
terraform {
required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
backend "s3" {
bucket = "mycompany-tfstate"
key = "prod/main.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "tfstate-locks"
kms_key_id = "alias/tfstate"
}
}
Production VPC (multi-AZ, private/public)
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "prod"
cidr = "10.0.0.0/16"
azs = ["us-east-1a","us-east-1b","us-east-1c"]
private_subnets = ["10.0.1.0/24","10.0.2.0/24","10.0.3.0/24"]
public_subnets = ["10.0.101.0/24","10.0.102.0/24","10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # prod = 1 per AZ; dev = true to save $
enable_vpn_gateway = false
enable_dns_hostnames = true
}
RDS Postgres — encrypted, Multi-AZ, private, no final skip
resource "aws_db_instance" "prod" {
identifier = "prod-postgres"
engine = "postgres"
engine_version = "16.3"
instance_class = "db.t3.medium"
allocated_storage = 100
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
db_subnet_group_name = aws_db_subnet_group.private.name
vpc_security_group_ids = [aws_security_group.db.id]
publicly_accessible = false
multi_az = true
backup_retention_period = 7
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "prod-postgres-final"
performance_insights_enabled = true
}
S3 bucket — private, encrypted, versioned, block-public
resource "aws_s3_bucket" "data" { bucket = "mycompany-data" }
resource "aws_s3_bucket_public_access_block" "data" {
bucket = aws_s3_bucket.data.id
block_public_acls = true; block_public_policy = true
ignore_public_acls = true; restrict_public_buckets = true
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms"; kms_master_key_id = aws_kms_key.s3.arn } }
}
ALB + Target Group + HTTP→HTTPS redirect
resource "aws_lb" "app" {
name = "app"
load_balancer_type = "application"
subnets = module.vpc.public_subnets
security_groups = [aws_security_group.alb.id]
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.app.arn
port = 443
protocol = "HTTPS"
certificate_arn = aws_acm_certificate.cert.arn
default_action { type = "forward"; target_group_arn = aws_lb_target_group.app.arn }
}
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.app.arn
port = 80
protocol = "HTTP"
default_action {
type = "redirect"
redirect { port = "443"; protocol = "HTTPS"; status_code = "HTTP_301" }
}
}
Lambda + CloudWatch role (bare minimum)
data "aws_iam_policy_document" "assume" {
statement {
actions = ["sts:AssumeRole"]
principals { type = "Service"; identifiers = ["lambda.amazonaws.com"] }
}
}
resource "aws_iam_role" "lambda" {
name = "app-lambda"
assume_role_policy = data.aws_iam_policy_document.assume.json
}
resource "aws_iam_role_policy_attachment" "basic_exec" {
role = aws_iam_role.lambda.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
🌐 VPC & Networking
Pick CIDRs that won't collide with your on-prem or partners.
CIDR sizing quick reference
| Prefix | IPs | Usable | Typical use |
|---|---|---|---|
/16 | 65,536 | 65,531 | VPC |
/20 | 4,096 | 4,091 | Large subnet |
/22 | 1,024 | 1,019 | Medium subnet |
/24 | 256 | 251 | Standard subnet |
/26 | 64 | 59 | Small subnet |
/28 | 16 | 11 | Minimum AWS subnet |
AWS reserves 5 IPs per subnet: network, VPC router, DNS, future use, broadcast.
RFC 1918 private ranges (don't collide)
10.0.0.0/8— 16.7M IPs. Most common for AWS VPCs.172.16.0.0/12— 1M IPs. Default Docker bridge is here (172.17.0.0/16).192.168.0.0/16— 65K IPs. Home routers use this. Avoid for corp.
Security Groups vs NACLs — 30-sec version
| Security Group | NACL | |
|---|---|---|
| Stateful? | Yes (returns allowed) | No (both sides rule) |
| Scope | ENI / instance | Subnet |
| Rules | Allow only (implicit deny) | Allow + explicit Deny |
| Order | All rules evaluated | Lowest number wins |
| Default | Deny all in / allow out | Allow all |
🖥 EC2 Instance Families
The one-line rule for picking the right instance.
| Family | Purpose | Use for |
|---|---|---|
t3 / t4g | Burstable | Dev, low-traffic web, CI runners |
m5 / m6i / m7g | General | Most production web apps, small DBs |
c5 / c6i / c7g | Compute-optimized | Video encoding, batch, game servers |
r5 / r6i / r7g | Memory-optimized | Redis, in-memory DBs, analytics |
x2iedn | High memory | SAP HANA, large in-memory |
i4i / im4gn | Storage-optimized | NoSQL, search, data warehouse nodes |
g5 / p5 | GPU | ML inference / training |
inf2 / trn1 | AWS Inferentia / Trainium | ML at optimized cost |
Graviton (ARM) = t4g / m7g / c7g / r7g — up to 40% better price/performance over x86 for most modern Linux workloads. Skip if you ship x86-only binaries (legacy Windows, some proprietary).
🪣 S3 Storage Classes
Match storage class to access pattern — can cut object-storage costs by 70%+.
| Class | Price /GB/mo | Min duration | Retrieval time | Use for |
|---|---|---|---|---|
| Standard | $0.023 | — | ms | Active data, <30d access pattern |
| Intelligent-Tiering | $0.023 → $0.004 | — | ms | Unknown/unpredictable access |
| Standard-IA | $0.0125 | 30 days | ms | Monthly-access backups |
| One Zone-IA | $0.01 | 30 days | ms | Re-creatable / secondary copies |
| Glacier Instant | $0.004 | 90 days | ms | Quarterly-access archives |
| Glacier Flexible | $0.0036 | 90 days | 1 min → 12 h | Rarely-accessed archives |
| Glacier Deep Archive | $0.00099 | 180 days | 12 → 48 h | Compliance retention 7+ yrs |
Lifecycle rule: hot → cold → delete
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "logs-tiering"; status = "Enabled"
transition { days = 30; storage_class = "STANDARD_IA" }
transition { days = 90; storage_class = "GLACIER" }
transition { days = 365; storage_class = "DEEP_ARCHIVE" }
expiration { days = 2555 } # 7 years
}
}
☸ kubectl — the 20 commands you use daily
Aliases, context switching, troubleshooting.
Setup
# Connect to EKS cluster aws eks update-kubeconfig --name prod-cluster --region us-east-1 # Check current context + namespace kubectl config current-context kubectl config view --minify --output 'jsonpath={..namespace}' # Switch namespace (install kubens first: brew install kubectx) kubens production
Inspect
kubectl get pods -A # all namespaces kubectl get pods -o wide -l app=api # filter by label kubectl describe pod <pod> # events + status kubectl get events --sort-by=.lastTimestamp | tail -20 kubectl top pods # needs metrics-server kubectl top nodes
Logs & debug
kubectl logs -f <pod> # tail kubectl logs <pod> -c sidecar # specific container kubectl logs <pod> --previous # crashed pod kubectl logs -l app=api --tail=50 # all pods of a label kubectl exec -it <pod> -- sh kubectl debug -it <pod> --image=busybox --target=<container>
Apply & rollout
kubectl apply -f deployment.yaml
kubectl rollout status deployment/api
kubectl rollout undo deployment/api # rollback
kubectl rollout restart deployment/api
kubectl scale deployment api --replicas=5
kubectl set image deployment/api api=myapp:v1.2.4
Port-forward & copy
kubectl port-forward svc/api 8080:80 kubectl cp <pod>:/var/log/app.log ./app.log kubectl cp ./fix.sh <pod>:/tmp/fix.sh
RBAC quick check
kubectl auth can-i create deployments -n prod kubectl auth can-i '*' '*' --as=system:serviceaccount:prod:app-sa kubectl get rolebindings,clusterrolebindings -A -o wide | grep <user>
⌨ AWS CLI One-Liners
Paste, replace placeholders, done.
S3
# Sync local → S3 with delete + SSE-KMS aws s3 sync ./site s3://mybucket --delete --sse aws:kms # Presigned GET URL valid 1 hour aws s3 presign s3://mybucket/key --expires-in 3600 # Size of a bucket aws s3 ls s3://mybucket --recursive --summarize | tail -2 # Delete all versions in versioned bucket aws s3api delete-objects --bucket mybucket --delete "$(aws s3api list-object-versions --bucket mybucket --output=json --query='{Objects: Versions[].{Key:Key,VersionId:VersionId}}')"
EC2
# Running instances w/ names aws ec2 describe-instances --query 'Reservations[].Instances[?State.Name==`running`].[InstanceId,Tags[?Key==`Name`].Value|[0],InstanceType]' --output table # Stop all non-prod instances (filter by tag) aws ec2 stop-instances --instance-ids $(aws ec2 describe-instances --filters Name=tag:Env,Values=dev Name=instance-state-name,Values=running --query 'Reservations[].Instances[].InstanceId' --output text) # SSM start-session (SSH replacement) aws ssm start-session --target i-0abc123
IAM
# Who am I? aws sts get-caller-identity # List users with last-used info aws iam list-users --query 'Users[].[UserName,PasswordLastUsed]' --output table # Rotate access key aws iam create-access-key --user-name alice aws iam update-access-key --user-name alice --access-key-id AKIA... --status Inactive aws iam delete-access-key --user-name alice --access-key-id AKIA...
CloudFormation / Terraform helpers
# Account + region aws configure list aws configure get region # Assume role quickly eval "$(aws sts assume-role --role-arn arn:aws:iam::123:role/Admin --role-session-name me | jq -r '.Credentials|"export AWS_ACCESS_KEY_ID=\(.AccessKeyId) AWS_SECRET_ACCESS_KEY=\(.SecretAccessKey) AWS_SESSION_TOKEN=\(.SessionToken)"')"
📊 CloudWatch Logs Insights
Query your logs like SQL. Save these in the console for one-click access.
Top 10 slowest API Gateway requests
fields @timestamp, @message, @duration | filter @type = "REPORT" | sort @duration desc | limit 10
Lambda errors grouped by function
fields @timestamp, @message | filter @message like /ERROR|Exception|Traceback/ | stats count() by @logStream | sort count desc
VPC Flow Logs — rejected connections by source IP
fields @timestamp, srcAddr, dstAddr, dstPort, action | filter action = "REJECT" | stats count() as attempts by srcAddr | sort attempts desc | limit 25
💰 Cost Optimization — Top 10 Wins
In rough order of highest impact for least effort.
- Commit to Savings Plans — up to 72% off steady-state EC2, Lambda, Fargate. Start with a low 1-year Compute SP.
- Right-size EC2 and RDS — AWS Compute Optimizer runs ML on your CloudWatch data and picks the right instance.
- Turn off dev at night — Lambda + EventBridge: stop dev EC2 / RDS 7pm → 7am = 65% saved.
- S3 lifecycle rules — move logs and old data to IA / Glacier automatically.
- NAT Gateway VPC Endpoints — S3 + DynamoDB gateway endpoints are FREE and bypass NAT data-processing charges.
- EBS gp2 → gp3 — 20% cheaper, more throughput. Migrate with zero downtime via modify-volume.
- Delete unattached EIPs — $3.60/mo per unused EIP. Trusted Advisor flags them.
- Spot for stateless batch — up to 90% off for fault-tolerant workloads.
- CloudFront in front of S3 — cheaper egress + global performance + free SSL.
- Graviton (ARM) — 40% better price-performance on
t4g / m7g / c7g / r7g.
🔒 Security Checklist — 12 rules
Go through this on every new AWS account before handing it to developers.
- Root user: hardware MFA, no access keys, locked in a safe.
- Enable MFA for every IAM user; require it for privileged actions.
- S3 Block Public Access at the account level — master switch.
- EBS encryption by default at the region level.
- CloudTrail multi-region, log-file validation, delivered to a separate logs account.
- GuardDuty enabled in every region; findings → Security Hub.
- IAM Access Analyzer to find unintended public/cross-account access.
- Use roles, not users — federate via IAM Identity Center; workloads via instance profiles / OIDC.
- Permissions boundaries for developer-created roles; SCPs at the OU level.
- Config enabled with the "aws-foundational-security-best-practices" rule pack.
- No SSH to the world — use SSM Session Manager instead.
- Secrets in Secrets Manager with rotation; never in env vars, Lambda config, or code.
🔧 Common Errors → Fixes
The errors you'll actually hit. Straight to the fix.
Terraform "Error acquiring the state lock"
# Check who holds the lock aws dynamodb get-item --table-name tfstate-locks --key '{"LockID":{"S":"mybucket/prod/main.tfstate-md5"}}' # If you're SURE nothing is running (check CI + teammates first) terraform force-unlock <lock-id>
"AccessDenied" on S3 despite IAM allow
- Check the bucket policy — resource policy can explicitly Deny.
- Check SCPs — parent OU may block the action.
- Check Permissions boundary on the role — may cap permissions.
- Check KMS key policy if the object is SSE-KMS.
- Test with IAM Policy Simulator before debugging further.
Pod stuck in "Pending" or "ImagePullBackOff"
kubectl describe pod <pod> # read Events at the bottom # Common fixes: # - Pending → nodes full; kubectl top nodes + kubectl describe node # - ImagePullBackOff → ECR auth; kubectl create secret docker-registry regcred ... # - CreateContainerConfigError → secret/configmap referenced doesn't exist
"Too many open connections" on RDS
- Put RDS Proxy in front — pools connections across Lambda/app instances.
- Lower idle timeouts in application DB client config.
- If Lambda: init the client outside the handler so it's reused between invocations.
High NAT Gateway bill
- Add S3 and DynamoDB Gateway VPC Endpoints — free and remove bulk of NAT data charges.
- Use Interface Endpoints for Secrets Manager, STS, ECR, CloudWatch Logs — ~$7/mo per endpoint but saves on data + NAT.
- In dev: use single-AZ NAT (1 gateway instead of 3) —
single_nat_gateway = truein the TF VPC module.
EKS IAM auth failing ("You must be logged in to the server")
# Re-fetch kubeconfig aws eks update-kubeconfig --name mycluster --region us-east-1 # Check aws-auth ConfigMap — does the role/user exist? kubectl -n kube-system get cm aws-auth -o yaml # Add a new IAM role to mapRoles (requires cluster admin access) kubectl -n kube-system edit cm aws-auth
🎯 Pair with AWS SAA practice
This cheatsheet is the reference. The game builds the muscle memory.