~80% AWS Cost Reduction Through PHP-FPM Tuning, Query Optimization & CDN Offloading
Our SaaS platform served 1000+ customer subdomains from a fleet of EC2 instances. AWS costs kept climbing — new instances were being added to handle load, but nobody had documented why the existing ones were struggling. There was no centralised runbook, no historical tuning record, and no AI tooling to consult. Everything had to be diagnosed from scratch using production signals.
This is a step-by-step account of how I found the root cause, fixed it at every layer, and ultimately eliminated the need for most of the additional capacity.
- Environment: Multi-tenant SaaS platform · AWS EC2 (Ubuntu) · PHP-FPM + Nginx · MySQL RDS · Redis ElastiCache · CloudFront
- Outcome: ~80% reduction in AWS infrastructure cost through systematic diagnosis, process tuning, query optimization, caching, and CDN offloading
Phase 1 — The Alert That Started Everything
What triggered the investigation
CloudWatch alarms fired for high memory utilisation (>85%) and elevated CPU on multiple EC2 instances. The instances were t3.medium (2 vCPU, 4GB RAM). We were scaling out to handle load rather than understanding what was consuming the resources.
The first thing I did was SSH into the instance and run top:
top - 14:32:01 up 3 days, 2:14, 1 user, load average: 6.42, 5.87, 4.91
Tasks: 214 total, 2 running, 212 sleeping
%Cpu(s): 78.4 us, 12.1 sy, 0.0 ni, 5.1 id, 4.4 wa, 0.0 hi, 0.0 si
MiB Mem : 3843.5 total, 128.4 free, 3601.2 used, 113.9 buff/cache
MiB Swap: 2048.0 total, 1204.8 free, 843.2 used. ...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3421 www-data 20 0 412648 52312 8124 S 4.2 1.4 0:12.33 php-fpm: pool www
3422 www-data 20 0 409312 51008 8100 S 3.8 1.3 0:11.87 php-fpm: pool www
3423 www-data 20 0 418920 53120 8208 S 3.7 1.4 0:13.01 php-fpm: pool www
3424 www-data 20 0 411232 50944 8092 S 3.9 1.3 0:10.99 php-fpm: pool www
... (60+ identical lines)
What this output told me
- Load average of 6.42 on a 2-core machine — the system was handling 3× more work than it could process at once. Anything above the number of CPU cores means processes are waiting in queue.
- 3601 MB used out of 3843 MB — nearly all physical RAM was consumed.
- 843 MB of swap in use — the OS had started evicting memory to disk. Swap I/O is orders of magnitude slower than RAM. This alone explains degraded response times.
- 60+ php-fpm: pool www processes each consuming ~50MB RSS — that’s 3GB of RAM from PHP-FPM alone on a 4GB instance.
- 4.4% iowait (wa) — CPU was sitting idle waiting for disk I/O, which swap usage was amplifying.
The immediate question was: why are there 60+ PHP-FPM workers running simultaneously?
Phase 2 — Understanding Why So Many PHP-FPM Workers Existed
How PHP-FPM worker pools work
PHP-FPM uses a process manager (pm) to spawn and manage worker processes. In dynamic mode, PHP-FPM spawns workers on demand up to a configured pm.max_children ceiling. Each worker handles one request at a time.
If a request takes a long time to complete (e.g., because it’s waiting on a slow database query), that worker is blocked for the entire duration. Nginx sees the worker is busy and routes the next incoming request to a different worker. PHP-FPM spawns more workers to keep up. This compounds until pm.max_children is hit — at which point requests begin queuing, timeouts occur, and users see 502/504 errors.
The workers weren’t the root problem. The slow requests were.
Verifying that workers were stuck — not just idle
# Check how many PHP-FPM processes are actively running vs sleeping
ps aux | grep php-fpm | grep -v grep | awk '{print $8}' | sort | uniq -c
Output:
4 R # Running - actively on CPU
58 S # Sleeping - waiting on I/O (database, disk, network)
58 workers were asleep — blocked waiting for something to return. That something was the database.
# See what each worker's request looks like (requires php-fpm status page enabled)
curl http://127.0.0.1/fpm-status?full
pool: www
process manager: dynamic
start time: 06/Apr/2025:11:18:02 +0000
accepted conn: 48291
listen queue: 12 ← requests waiting for a free worker
max listen queue: 47
idle processes: 2
active processes: 60
total processes: 62
process 3421
request uri: /api/v1/orders/list
request duration: 18342194 ← 18 seconds
last request cpu: 0.00
last request memory: 52428800
process 3422
request uri: /api/v1/reports/generate
request duration: 24107881 ← 24 seconds
18 to 24-second requests. Normal API responses should complete in under 200ms. Something in the application was extremely slow, and it wasn’t the PHP code itself — it was the database.
Phase 3 — Database Investigation
Step 1: Check live database connections
mysql -u admin -p -h <rds-endpoint>
SHOW PROCESSLIST;
+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+
| 101 | appuser | 10.0.1.45:49812 | saas_db | Query | 22 | Sending data | SELECT * FROM tenant_orders WHERE tenant_id..|
| 102 | appuser | 10.0.1.45:49813 | saas_db | Query | 19 | Sending data | SELECT * FROM tenant_orders WHERE tenant_id..|
| 103 | appuser | 10.0.1.45:49819 | saas_db | Query | 31 | Sending data | SELECT r.*, u.name, o.* FROM reports r JOIN ..|
| 104 | appuser | 10.0.1.45:49821 | saas_db | Query | 3 | statistics | SELECT COUNT(*) FROM audit_logs WHERE ... |
+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+
What to look for in SHOW PROCESSLIST
- Time column — anything above 2-3 seconds on an OLTP (transactional) database is a warning. Anything above 10 seconds is a serious problem.
- State: “Sending data” — the query has executed and is now streaming a large result set back to the application. Typically indicates a missing
LIMITclause or a missing index causing too many rows to be returned. - State: “statistics” — MySQL is calculating the query execution plan. This happens before the query runs. A long time here means a complex join with no clear index path.
- State: “Locked” — another query holds a lock that this query is waiting for. This causes cascading latency across all workers.
Multiple queries in the 20-30 second range, all from the same application host — confirmed the database was the bottleneck.
Step 2: Enable and read the slow query log
On RDS, this is done via a parameter group:
slow_query_log = 1
long_query_time = 1 # Flag queries taking longer than 1 second
log_queries_not_using_indexes = 1 # Flag full table scans regardless of duration
After enabling and waiting several minutes:
# On an EC2-based MySQL, you'd run:
tail -f /var/log/mysql/slow.log
# Time: 2025-04-06T14:22:11.832847Z
# User@Host: appuser[appuser] @ 10.0.1.45 [10.0.1.45]
# Query_time: 21.384710 Lock_time: 0.000182 Rows_sent: 14823 Rows_examined: 2847291
SET timestamp=1712416931;
SELECT * FROM tenant_orders WHERE tenant_id = 'abc-123' ORDER BY created_at DESC;
What this entry reveals
Query_time: 21.38— 21 seconds to complete.Rows_examined: 2,847,291— MySQL scanned nearly 3 million rows.Rows_sent: 14,823— but only returned 14,823 rows.
MySQL read 2.8 million rows to return 14,823. It was doing a full table scan. No index on tenant_id.
Step 3: Confirm with EXPLAIN
EXPLAIN SELECT * FROM tenant_orders WHERE tenant_id = 'abc-123' ORDER BY created_at DESC;
+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+
| 1 | SIMPLE | tenant_orders | ALL | NULL | NULL | NULL | NULL | 2756890 | Using filesort |
+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+
| Column | This value | What it means |
|---|---|---|
type | ALL | Full table scan — worst possible. MySQL reads every row. Target: ref or range. |
possible_keys | NULL | No index could even be considered for this query. |
key | NULL | No index was used. |
rows | 2,756,890 | MySQL estimates it must examine 2.7M rows. |
Extra | Using filesort | Results are sorted in memory/disk after the scan. Expensive. |
The fix — add a composite index
-- Index on tenant_id for filtering + created_at for sort
ALTER TABLE tenant_orders ADD INDEX idx_tenant_created (tenant_id, created_at DESC);
After the index was added, running EXPLAIN again:
+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+
| 1 | SIMPLE | tenant_orders | ref | idx_tenant_created | idx_tenant_created | 767 | const | 15203 | 100.00 |
+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+
type: ref— using the index to go directly to matching rows.rows: 15,203— down from 2.7 million.- Query time dropped from 21 seconds to under 80ms.
With query times back to normal, PHP-FPM workers were no longer stuck. The pile-up of 60+ workers dissolved to 8-10 under normal load.
Phase 4 — PHP-FPM Pool Tuning: The Formula
Even after fixing the queries, the pm.max_children value was still set to the platform default (which was arbitrarily high). Without a calculated ceiling, the system could still over-spawn workers during traffic spikes and exhaust RAM, triggering swap and the same degradation.
How to calculate pm.max_children correctly
The formula:
pm.max_children = (Available RAM for PHP-FPM) / (Average memory per PHP-FPM worker)
Step 1: Find average memory per worker
ps --no-headers -o "rss,cmd" -C php-fpm | awk '{ sum+=$1; count++ } END { printf "Avg: %d KB (%.0f MB)\n", sum/count, sum/count/1024 }'
Avg: 51840 KB (51 MB)
Step 2: Determine available RAM for PHP-FPM
free -m
total used free shared buff/cache available
Mem: 3843 412 312 84 3119 3116
Swap: 2048 0 2048
On this instance (4GB RAM):
- OS and system services (Nginx, SSH, monitoring agents): ~400MB reserved
- Available for PHP-FPM: 3843 - 400 = 3443 MB
pm.max_children = 3443 MB / 51 MB = 67
Apply a safety buffer of ~20% to leave room for traffic spikes and OS page cache:
pm.max_children = 67 × 0.80 = 54 (round to 50 for clean config)
Step 3: Derive the remaining pool settings
pm = dynamic
; Maximum number of workers - never exceed this
pm.max_children = 50
; Workers alive at startup
pm.start_servers = 10 ; floor((min_spare + max_spare) / 2)
; Minimum idle workers kept warm to prevent cold starts
pm.min_spare_servers = 5 ; ~10% of max_children
; Maximum idle workers - PHP-FPM kills workers above this to free RAM
pm.max_spare_servers = 15 ; ~30% of max_children
; Recycle workers after N requests to prevent memory leaks accumulating
pm.max_requests = 500
Why these values matter
- pm.min_spare_servers = 5 — if you set this too low (e.g., 1), every traffic burst requires spawning new workers from scratch, introducing latency (cold starts). Keep idle workers ready.
- pm.max_spare_servers = 15 — if this is too high, you waste RAM on idle workers during quiet periods. PHP-FPM kills excess idle workers automatically when traffic drops.
- pm.max_requests = 500 — PHP has memory leaks in long-running processes. Recycling workers at 500 requests prevents gradual memory growth that would otherwise require instance restarts at 3am.
Apply the config:
vim /etc/php/8.1/fpm/pool.d/www.conf
# Edit the above values
# Validate configuration before reloading
php-fpm8.1 -t
# Reload without dropping connections
systemctl reload php-fpm8.1
Phase 5 — Disk I/O: The Silent Contributor
Even with queries fixed and pool tuned, there was still elevated iowait at peak traffic. Disk I/O was a contributing factor — particularly on the original gp2 EBS volumes.
Reading I/O metrics
iostat -xz 1 5
Device r/s w/s rkB/s wkB/s await r_await w_await %util
nvme0n1 0.00 312.42 0.00 2748.16 48.12 0.00 48.12 96.18
| Metric | Value | Meaning |
|---|---|---|
r/s / w/s | 0 / 312 | Reads and writes per second. Here: write-heavy (logs, sessions, opcache). |
await | 48.12ms | Average time (ms) each I/O request waits to be served. >20ms is a problem for a database or session workload. Healthy is 1-5ms. |
%util | 96.18% | The disk is busy 96% of the time — it’s saturated. >80% sustained = bottleneck. |
Root causes found
- PHP sessions stored on disk. Every request read and wrote a session file to EBS. At 50 concurrent workers × hundreds of requests per second = thousands of session file I/Os.
- gp2 EBS IOPS baseline was only 300 IOPS (3 IOPS/GB on a 100GB volume). The workload required far more.
Fix 1 — Migrate EBS from gp2 to gp3
# gp3 provides 3000 IOPS baseline with no size dependency, at lower cost
aws ec2 modify-volume \
--volume-id vol-0abc123def456 \
--volume-type gp3 \
--iops 3000 \
--throughput 125
This was a live migration — zero downtime, zero data loss. gp3 is also ~20% cheaper than gp2 at equivalent size.
Fix 2 — Move PHP sessions to Redis (see Phase 6).
Phase 6 — Redis: Offloading Sessions and Application Cache
Why Redis eliminates two problems at once
Storing sessions on disk creates I/O contention (shown above). Storing them in a database creates query overhead. Redis — an in-memory key-value store — solves both.
Configure PHP to use Redis for sessions
; /etc/php/8.1/fpm/php.ini
session.save_handler = redis
session.save_path = "tcp://your-elasticache-endpoint:6379"
Verify sessions are landing in Redis
redis-cli -h your-elasticache-endpoint -p 6379
KEYS PHPREDIS_SESSION:*
# Returns list of active session keys
TTL PHPREDIS_SESSION:abc123def456
# Returns remaining TTL in seconds - confirms expiry is working
Application-level caching — cache expensive query results
// Before: hits MySQL on every request
$orders = $db->query("SELECT * FROM tenant_orders WHERE tenant_id = ?", $tenantId);
// After: check Redis first, fall back to MySQL, cache for 5 minutes
$cacheKey = "tenant_orders:{$tenantId}";
$cached = $redis->get($cacheKey);
if ($cached) {
$orders = json_decode($cached, true);
} else {
$orders = $db->query("SELECT * FROM tenant_orders WHERE tenant_id = ?", $tenantId);
$redis->setex($cacheKey, 300, json_encode($orders)); // TTL: 300 seconds
}
Monitor Redis hit rate to confirm caching is effective
redis-cli -h your-elasticache-endpoint INFO stats | grep -E "keyspace_hits|keyspace_misses"
keyspace_hits:4821033
keyspace_misses:143291
Hit rate = 4821033 / (4821033 + 143291) = 97.1% — 97% of data requests served from memory, never touching the database.
Phase 7 — File Permissions: The Hidden Error Amplifier
During log analysis, I noticed thousands of lines like this in the Nginx error log:
tail -f /var/log/nginx/error.log
2025/04/06 14:38:02 [error] 3421#3421: *48291 connect() to unix:/run/php/php8.1-fpm.sock failed (13: Permission denied)
2025/04/06 14:38:02 [error] 3421#3421: *48292 connect() to unix:/run/php/php8.1-fpm.sock failed (13: Permission denied)
Every one of these errors caused Nginx to attempt retries. Retries held open connections. Held-open connections consumed memory and file descriptors. At volume, this was silently degrading the system and inflating error rates in CloudWatch.
Diagnose
# Check socket permissions
ls -la /run/php/php8.1-fpm.sock
srw-rw---- 1 www-data www-data 0 Apr 6 11:18 /run/php/php8.1-fpm.sock
The socket is owned by www-data:www-data with mode 660. Nginx runs as nginx user — it has no permission to connect.
Fix — set correct owner and mode in pool config
; /etc/php/8.1/fpm/pool.d/www.conf
listen.owner = nginx
listen.group = nginx
listen.mode = 0660
systemctl reload php-fpm8.1
ls -la /run/php/php8.1-fpm.sock
srw-rw---- 1 nginx nginx 0 Apr 6 14:41 /run/php/php8.1-fpm.sock
Errors stopped immediately. Retry overhead disappeared.
Phase 8 — CloudFront CDN: Eliminating Origin Load for Static Content
Before CDN, every request for images, CSS, JavaScript, and downloadable files hit the EC2 origin directly, consuming PHP-FPM workers, EC2 CPU, and RDS bandwidth even for completely static content.
Traffic breakdown before CDN
Using Nginx access logs to understand what the origin was serving:
awk '{print $7}' /var/log/nginx/access.log | \
sed 's/\?.*$//' | \
grep -oE '\.(jpg|jpeg|png|gif|css|js|pdf|woff2)$' | \
sort | uniq -c | sort -rn | head -10
48291 .jpg
31204 .css
28841 .js
12047 .png
4821 .pdf
2103 .woff2
Over 127,000 requests for static assets hitting the origin per log rotation period — none of them needed to touch EC2 or the database at all.
CloudFront configuration
# Create CloudFront distribution via AWS CLI
aws cloudfront create-distribution \
--origin-domain-name your-alb.us-east-1.elb.amazonaws.com \
--default-cache-behavior \
ViewerProtocolPolicy=redirect-to-https,\
CachePolicyId=658327ea-f89d-4fab-a63d-7e88639e58f6
Cache behaviour rules configured
| Path pattern | Cache TTL | Rationale |
|---|---|---|
*.jpg, *.png, *.gif | 86400s (24h) | Images don’t change without a filename change |
*.css, *.js | 3600s (1h) | Versioned by deploy — safe to cache |
*.pdf | 86400s (24h) | Downloadable files, static |
/api/* | 0s (no cache) | Dynamic content, must reach origin |
/admin/* | 0s (no cache) | Authenticated pages, must reach origin |
Verify cache is working
curl -I https://yourplatform.com/assets/logo.png
HTTP/2 200
x-cache: Hit from cloudfront ← served from edge, never touched EC2
age: 42831 ← cached for 42831 seconds (nearly 12 hours)
After CloudFront was configured, over 85% of all requests were served from CloudFront edge nodes. Origin traffic dropped dramatically, reducing EC2 CPU, Nginx worker usage, and RDS connection pool pressure simultaneously.
Phase 9 — Swap Configuration (Prevent OOM Kills)
With RAM now properly sized and PHP-FPM capped, swap was no longer being used for overflow — but it still needed to be correctly configured as a safety net.
# Check current swap and swappiness
free -m
cat /proc/sys/vm/swappiness
# Default is often 60 - too aggressive for a server workload
60
A swappiness of 60 means the kernel starts moving pages to swap when 40% of RAM is still free. On an application server, this causes unnecessary I/O. Lower it:
# Apply immediately
sudo sysctl vm.swappiness=10
# Make permanent
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf
A value of 10 tells the kernel: prefer to use RAM, only use swap when really necessary. This kept the disk I/O low during normal operations and reserved swap as a genuine last resort.
Results Summary
| Metric | Before | After |
|---|---|---|
| PHP-FPM workers (concurrent) | 60-70 | 8-12 |
| Avg PHP-FPM query-induced wait | 18-24 seconds | <200ms |
| RAM used (4GB instance) | 3.6 GB (93%) | 1.4 GB (36%) |
| Swap used | 843 MB | 0 MB |
EBS disk await | 48ms (saturated) | 3ms |
| DB queries on hot paths | Full table scan (2.8M rows) | Index scan (15K rows) |
| Redis cache hit rate | 0% (no cache) | 97.1% |
| CloudFront cache hit rate | 0% (no CDN) | 85%+ |
| EC2 instances required | 6 | 2 (with headroom) |
| AWS monthly cost | Baseline | ~80% lower |
Key Takeaways
1. Always trace from symptom to root cause before scaling
Adding EC2 instances to handle load was masking the problem. The real issue was slow queries holding PHP-FPM workers hostage. Scaling out would have made the database worse, not better.
2. EXPLAIN is the most important tool in database debugging
type: ALL and key: NULL in EXPLAIN output means full table scan — it will destroy performance at scale. Add the index, run EXPLAIN again, confirm type: ref.
3. PHP-FPM pool configuration must be calculated, not guessed
The formula max_children = available_RAM / avg_worker_RAM × 0.80 is simple but it must be grounded in real measurements from your own server with ps — not copied from a blog post.
4. Permissions errors are never minor
A single Permission denied on a socket generates retry loops, wastes connections, and compounds across thousands of requests per minute. Check dmesg and Nginx error logs regularly.
5. CDN is free EC2 capacity
Every static asset served by CloudFront is a request that never touches your server. For a SaaS platform with 1000+ tenants each requesting assets, this is enormous. Configure it on day one.
6. Redis does double duty
Moving sessions off disk simultaneously fixes I/O saturation and eliminates a class of race conditions in multi-worker environments (where two workers might write to the same session file). The cache layer on top reduces database load further.
Frequently Asked Questions
How do you calculate the correct pm.max_children value for PHP-FPM?
Use the formula: pm.max_children = (Available RAM for PHP-FPM) / (Average memory per PHP-FPM worker). Measure average worker memory with ps --no-headers -o "rss,cmd" -C php-fpm, subtract OS overhead from total RAM, divide, then apply a ~20% safety buffer for traffic spikes and OS page cache.
How do you find slow MySQL queries in production?
Enable the MySQL slow query log with slow_query_log=1 and long_query_time=1. Also enable log_queries_not_using_indexes to catch full table scans. Use SHOW PROCESSLIST to see live queries and EXPLAIN to analyze query execution plans. Look for type: ALL (full table scan) and key: NULL (no index used).
Why are PHP-FPM workers stuck in sleeping state?
Sleeping PHP-FPM workers are blocked waiting for I/O — usually a slow database query. When a query takes 20+ seconds, the worker is held hostage for the entire duration. PHP-FPM spawns more workers to handle incoming requests, which consumes all available RAM and triggers swap usage.
Should I use gp2 or gp3 EBS volumes on AWS?
Use gp3. It provides 3000 IOPS baseline regardless of volume size (gp2 only gives 3 IOPS/GB), costs ~20% less, and supports live migration with zero downtime using aws ec2 modify-volume. There is no reason to use gp2 for new workloads.
How does Redis reduce PHP-FPM worker count?
Redis serves two roles: storing PHP sessions (eliminating disk I/O from session file reads/writes) and caching expensive database queries. With a 97% cache hit rate, most requests never touch MySQL, so workers complete in milliseconds instead of seconds, drastically reducing concurrent worker count.
Related Guides
How to Reduce AWS Costs by Scheduling Dev and Staging Resources
Stop paying for dev and staging environments that run 24/7 when no one is using them.
DATABASEHow to Set Up a Production RDS Database on AWS with Terraform
Multi-AZ, automated backups, parameter groups, and security groups done right.