CASE STUDY

~80% AWS Cost Reduction Through PHP-FPM Tuning, Query Optimization & CDN Offloading

Q: How do you calculate the correct pm.max_children value for PHP-FPM?

Use the formula: pm.max_children = (Available RAM for PHP-FPM) / (Average memory per PHP-FPM worker). Measure average worker memory with ps, subtract OS overhead from total RAM, divide, then apply a ~20% safety buffer for traffic spikes and OS page cache.

Q: How do you find slow MySQL queries in production?

Enable the MySQL slow query log with slow_query_log=1 and long_query_time=1. Also enable log_queries_not_using_indexes to catch full table scans. Use SHOW PROCESSLIST to see live queries and EXPLAIN to analyze query execution plans.

Q: Why are PHP-FPM workers stuck in sleeping state?

Sleeping PHP-FPM workers are blocked waiting for I/O, usually a slow database query. When a query takes 20+ seconds, the worker is held hostage for the entire duration. PHP-FPM spawns more workers to handle incoming requests, consuming all available RAM.

Q: Should I use gp2 or gp3 EBS volumes on AWS?

Use gp3. It provides 3000 IOPS baseline regardless of volume size (gp2 only gives 3 IOPS/GB), costs ~20% less, and supports live migration with zero downtime. There is no reason to use gp2 for new workloads.

Q: How does Redis reduce PHP-FPM worker count?

Redis serves two roles: storing PHP sessions (eliminating disk I/O from session file reads/writes) and caching expensive database queries. With a 97% cache hit rate, most requests never touch MySQL, so workers complete in milliseconds instead of seconds, drastically reducing concurrent worker count.

By Akshay Ghalme·April 6, 2026·18 min read

Our SaaS platform served 1000+ customer subdomains from a fleet of EC2 instances. AWS costs kept climbing — new instances were being added to handle load, but nobody had documented why the existing ones were struggling. There was no centralised runbook, no historical tuning record, and no AI tooling to consult. Everything had to be diagnosed from scratch using production signals.

This is a step-by-step account of how I found the root cause, fixed it at every layer, and ultimately eliminated the need for most of the additional capacity.

Environment: Multi-tenant SaaS platform · AWS EC2 (Ubuntu) · PHP-FPM + Nginx · MySQL RDS · Redis ElastiCache · CloudFront
Outcome: ~80% reduction in AWS infrastructure cost through systematic diagnosis, process tuning, query optimization, caching, and CDN offloading

Phase 1 — The Alert That Started Everything

What triggered the investigation

CloudWatch alarms fired for high memory utilisation (>85%) and elevated CPU on multiple EC2 instances. The instances were t3.medium (2 vCPU, 4GB RAM). We were scaling out to handle load rather than understanding what was consuming the resources.

The first thing I did was SSH into the instance and run top:

top - 14:32:01 up 3 days,  2:14,  1 user,  load average: 6.42, 5.87, 4.91
Tasks: 214 total,    2 running, 212 sleeping
%Cpu(s): 78.4 us, 12.1 sy,  0.0 ni,  5.1 id,  4.4 wa,  0.0 hi,  0.0 si
MiB Mem :   3843.5 total,    128.4 free,   3601.2 used,    113.9 buff/cache
MiB Swap:   2048.0 total,   1204.8 free,    843.2 used.    ...

  PID USER     PR  NI    VIRT    RES    SHR S  %CPU  %MEM   TIME+ COMMAND
 3421 www-data 20   0  412648  52312   8124 S   4.2   1.4  0:12.33 php-fpm: pool www
 3422 www-data 20   0  409312  51008   8100 S   3.8   1.3  0:11.87 php-fpm: pool www
 3423 www-data 20   0  418920  53120   8208 S   3.7   1.4  0:13.01 php-fpm: pool www
 3424 www-data 20   0  411232  50944   8092 S   3.9   1.3  0:10.99 php-fpm: pool www
 ...   (60+ identical lines)

What this output told me

Load average of 6.42 on a 2-core machine — the system was handling 3× more work than it could process at once. Anything above the number of CPU cores means processes are waiting in queue.
3601 MB used out of 3843 MB — nearly all physical RAM was consumed.
843 MB of swap in use — the OS had started evicting memory to disk. Swap I/O is orders of magnitude slower than RAM. This alone explains degraded response times.
60+ php-fpm: pool www processes each consuming ~50MB RSS — that’s 3GB of RAM from PHP-FPM alone on a 4GB instance.
4.4% iowait (wa) — CPU was sitting idle waiting for disk I/O, which swap usage was amplifying.

The immediate question was: why are there 60+ PHP-FPM workers running simultaneously?

Phase 2 — Understanding Why So Many PHP-FPM Workers Existed

How PHP-FPM worker pools work

PHP-FPM uses a process manager (pm) to spawn and manage worker processes. In dynamic mode, PHP-FPM spawns workers on demand up to a configured pm.max_children ceiling. Each worker handles one request at a time.

If a request takes a long time to complete (e.g., because it’s waiting on a slow database query), that worker is blocked for the entire duration. Nginx sees the worker is busy and routes the next incoming request to a different worker. PHP-FPM spawns more workers to keep up. This compounds until pm.max_children is hit — at which point requests begin queuing, timeouts occur, and users see 502/504 errors.

The workers weren’t the root problem. The slow requests were.

Verifying that workers were stuck — not just idle

# Check how many PHP-FPM processes are actively running vs sleeping
ps aux | grep php-fpm | grep -v grep | awk '{print $8}' | sort | uniq -c

Output:

    4 R    # Running - actively on CPU
   58 S    # Sleeping - waiting on I/O (database, disk, network)

58 workers were asleep — blocked waiting for something to return. That something was the database.

# See what each worker's request looks like (requires php-fpm status page enabled)
curl http://127.0.0.1/fpm-status?full

pool:                 www
process manager:      dynamic
start time:           06/Apr/2025:11:18:02 +0000
accepted conn:        48291
listen queue:         12          ← requests waiting for a free worker
max listen queue:     47
idle processes:       2
active processes:     60
total processes:      62

process 3421
  request uri:        /api/v1/orders/list
  request duration:   18342194    ← 18 seconds
  last request cpu:   0.00
  last request memory: 52428800

process 3422
  request uri:        /api/v1/reports/generate
  request duration:   24107881    ← 24 seconds

18 to 24-second requests. Normal API responses should complete in under 200ms. Something in the application was extremely slow, and it wasn’t the PHP code itself — it was the database.

Phase 3 — Database Investigation

Step 1: Check live database connections

mysql -u admin -p -h <rds-endpoint>

SHOW PROCESSLIST;

+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+
| Id  | User    | Host             | db      | Command | Time | State        | Info                                         |
+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+
| 101 | appuser | 10.0.1.45:49812  | saas_db | Query   |   22 | Sending data | SELECT * FROM tenant_orders WHERE tenant_id..|
| 102 | appuser | 10.0.1.45:49813  | saas_db | Query   |   19 | Sending data | SELECT * FROM tenant_orders WHERE tenant_id..|
| 103 | appuser | 10.0.1.45:49819  | saas_db | Query   |   31 | Sending data | SELECT r.*, u.name, o.* FROM reports r JOIN ..|
| 104 | appuser | 10.0.1.45:49821  | saas_db | Query   |    3 | statistics   | SELECT COUNT(*) FROM audit_logs WHERE ...     |
+-----+---------+------------------+---------+---------+------+--------------+----------------------------------------------+

What to look for in SHOW PROCESSLIST

Time column — anything above 2-3 seconds on an OLTP (transactional) database is a warning. Anything above 10 seconds is a serious problem.
State: “Sending data” — the query has executed and is now streaming a large result set back to the application. Typically indicates a missing LIMIT clause or a missing index causing too many rows to be returned.
State: “statistics” — MySQL is calculating the query execution plan. This happens before the query runs. A long time here means a complex join with no clear index path.
State: “Locked” — another query holds a lock that this query is waiting for. This causes cascading latency across all workers.

Multiple queries in the 20-30 second range, all from the same application host — confirmed the database was the bottleneck.

Step 2: Enable and read the slow query log

On RDS, this is done via a parameter group:

slow_query_log          = 1
long_query_time         = 1         # Flag queries taking longer than 1 second
log_queries_not_using_indexes = 1   # Flag full table scans regardless of duration

After enabling and waiting several minutes:

# On an EC2-based MySQL, you'd run:
tail -f /var/log/mysql/slow.log

# Time: 2025-04-06T14:22:11.832847Z
# User@Host: appuser[appuser] @ 10.0.1.45 [10.0.1.45]
# Query_time: 21.384710  Lock_time: 0.000182  Rows_sent: 14823  Rows_examined: 2847291
SET timestamp=1712416931;
SELECT * FROM tenant_orders WHERE tenant_id = 'abc-123' ORDER BY created_at DESC;

What this entry reveals

Query_time: 21.38 — 21 seconds to complete.
Rows_examined: 2,847,291 — MySQL scanned nearly 3 million rows.
Rows_sent: 14,823 — but only returned 14,823 rows.

MySQL read 2.8 million rows to return 14,823. It was doing a full table scan. No index on tenant_id.

Step 3: Confirm with EXPLAIN

EXPLAIN SELECT * FROM tenant_orders WHERE tenant_id = 'abc-123' ORDER BY created_at DESC;

+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table          | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | tenant_orders  | ALL  | NULL          | NULL | NULL    | NULL | 2756890 | Using filesort |
+----+-------------+----------------+------+---------------+------+---------+------+---------+----------------+

Column	This value	What it means
`type`	`ALL`	Full table scan — worst possible. MySQL reads every row. Target: `ref` or `range`.
`possible_keys`	`NULL`	No index could even be considered for this query.
`key`	`NULL`	No index was used.
`rows`	`2,756,890`	MySQL estimates it must examine 2.7M rows.
`Extra`	`Using filesort`	Results are sorted in memory/disk after the scan. Expensive.

The fix — add a composite index

-- Index on tenant_id for filtering + created_at for sort
ALTER TABLE tenant_orders ADD INDEX idx_tenant_created (tenant_id, created_at DESC);

After the index was added, running EXPLAIN again:

+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+
| id | select_type | table          | type | possible_keys        | key                  | key_len | ref   | rows  | Extra   |
+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+
|  1 | SIMPLE      | tenant_orders  | ref  | idx_tenant_created   | idx_tenant_created   | 767     | const | 15203 | 100.00  |
+----+-------------+----------------+------+----------------------+----------------------+---------+-------+-------+---------+

type: ref — using the index to go directly to matching rows.
rows: 15,203 — down from 2.7 million.
Query time dropped from 21 seconds to under 80ms.

With query times back to normal, PHP-FPM workers were no longer stuck. The pile-up of 60+ workers dissolved to 8-10 under normal load.

Phase 4 — PHP-FPM Pool Tuning: The Formula

Even after fixing the queries, the pm.max_children value was still set to the platform default (which was arbitrarily high). Without a calculated ceiling, the system could still over-spawn workers during traffic spikes and exhaust RAM, triggering swap and the same degradation.

How to calculate pm.max_children correctly

The formula:

pm.max_children = (Available RAM for PHP-FPM) / (Average memory per PHP-FPM worker)

Step 1: Find average memory per worker

ps --no-headers -o "rss,cmd" -C php-fpm | awk '{ sum+=$1; count++ } END { printf "Avg: %d KB (%.0f MB)\n", sum/count, sum/count/1024 }'

Avg: 51840 KB (51 MB)

Step 2: Determine available RAM for PHP-FPM

free -m

              total      used      free    shared  buff/cache  available
Mem:           3843       412       312        84       3119       3116
Swap:          2048         0      2048

On this instance (4GB RAM):

OS and system services (Nginx, SSH, monitoring agents): ~400MB reserved
Available for PHP-FPM: 3843 - 400 = 3443 MB

pm.max_children = 3443 MB / 51 MB = 67

Apply a safety buffer of ~20% to leave room for traffic spikes and OS page cache:

pm.max_children = 67 × 0.80 = 54  (round to 50 for clean config)

Step 3: Derive the remaining pool settings

pm = dynamic

; Maximum number of workers - never exceed this
pm.max_children = 50

; Workers alive at startup
pm.start_servers = 10          ; floor((min_spare + max_spare) / 2)

; Minimum idle workers kept warm to prevent cold starts
pm.min_spare_servers = 5       ; ~10% of max_children

; Maximum idle workers - PHP-FPM kills workers above this to free RAM
pm.max_spare_servers = 15      ; ~30% of max_children

; Recycle workers after N requests to prevent memory leaks accumulating
pm.max_requests = 500

Why these values matter

pm.min_spare_servers = 5 — if you set this too low (e.g., 1), every traffic burst requires spawning new workers from scratch, introducing latency (cold starts). Keep idle workers ready.
pm.max_spare_servers = 15 — if this is too high, you waste RAM on idle workers during quiet periods. PHP-FPM kills excess idle workers automatically when traffic drops.
pm.max_requests = 500 — PHP has memory leaks in long-running processes. Recycling workers at 500 requests prevents gradual memory growth that would otherwise require instance restarts at 3am.

Apply the config:

vim /etc/php/8.1/fpm/pool.d/www.conf
# Edit the above values

# Validate configuration before reloading
php-fpm8.1 -t

# Reload without dropping connections
systemctl reload php-fpm8.1

Phase 5 — Disk I/O: The Silent Contributor

Even with queries fixed and pool tuned, there was still elevated iowait at peak traffic. Disk I/O was a contributing factor — particularly on the original gp2 EBS volumes.

Reading I/O metrics

iostat -xz 1 5

Device          r/s     w/s    rkB/s    wkB/s  await  r_await  w_await  %util
nvme0n1        0.00  312.42     0.00  2748.16  48.12     0.00    48.12  96.18

Metric	Value	Meaning
`r/s` / `w/s`	0 / 312	Reads and writes per second. Here: write-heavy (logs, sessions, opcache).
`await`	48.12ms	Average time (ms) each I/O request waits to be served. >20ms is a problem for a database or session workload. Healthy is 1-5ms.
`%util`	96.18%	The disk is busy 96% of the time — it’s saturated. >80% sustained = bottleneck.

Root causes found

PHP sessions stored on disk. Every request read and wrote a session file to EBS. At 50 concurrent workers × hundreds of requests per second = thousands of session file I/Os.
gp2 EBS IOPS baseline was only 300 IOPS (3 IOPS/GB on a 100GB volume). The workload required far more.

Fix 1 — Migrate EBS from gp2 to gp3

# gp3 provides 3000 IOPS baseline with no size dependency, at lower cost
aws ec2 modify-volume \
  --volume-id vol-0abc123def456 \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

This was a live migration — zero downtime, zero data loss. gp3 is also ~20% cheaper than gp2 at equivalent size.

Fix 2 — Move PHP sessions to Redis (see Phase 6).

Phase 6 — Redis: Offloading Sessions and Application Cache

Why Redis eliminates two problems at once

Storing sessions on disk creates I/O contention (shown above). Storing them in a database creates query overhead. Redis — an in-memory key-value store — solves both.

Configure PHP to use Redis for sessions

; /etc/php/8.1/fpm/php.ini
session.save_handler = redis
session.save_path    = "tcp://your-elasticache-endpoint:6379"

Verify sessions are landing in Redis

redis-cli -h your-elasticache-endpoint -p 6379

KEYS PHPREDIS_SESSION:*
# Returns list of active session keys

TTL PHPREDIS_SESSION:abc123def456
# Returns remaining TTL in seconds - confirms expiry is working

Application-level caching — cache expensive query results

// Before: hits MySQL on every request
$orders = $db->query("SELECT * FROM tenant_orders WHERE tenant_id = ?", $tenantId);

// After: check Redis first, fall back to MySQL, cache for 5 minutes
$cacheKey = "tenant_orders:{$tenantId}";
$cached = $redis->get($cacheKey);

if ($cached) {
    $orders = json_decode($cached, true);
} else {
    $orders = $db->query("SELECT * FROM tenant_orders WHERE tenant_id = ?", $tenantId);
    $redis->setex($cacheKey, 300, json_encode($orders));  // TTL: 300 seconds
}

Monitor Redis hit rate to confirm caching is effective

redis-cli -h your-elasticache-endpoint INFO stats | grep -E "keyspace_hits|keyspace_misses"

keyspace_hits:4821033
keyspace_misses:143291

Hit rate = 4821033 / (4821033 + 143291) = 97.1% — 97% of data requests served from memory, never touching the database.

Phase 7 — File Permissions: The Hidden Error Amplifier

During log analysis, I noticed thousands of lines like this in the Nginx error log:

tail -f /var/log/nginx/error.log

2025/04/06 14:38:02 [error] 3421#3421: *48291 connect() to unix:/run/php/php8.1-fpm.sock failed (13: Permission denied)
2025/04/06 14:38:02 [error] 3421#3421: *48292 connect() to unix:/run/php/php8.1-fpm.sock failed (13: Permission denied)

Every one of these errors caused Nginx to attempt retries. Retries held open connections. Held-open connections consumed memory and file descriptors. At volume, this was silently degrading the system and inflating error rates in CloudWatch.

Diagnose

# Check socket permissions
ls -la /run/php/php8.1-fpm.sock

srw-rw---- 1 www-data www-data 0 Apr  6 11:18 /run/php/php8.1-fpm.sock

The socket is owned by www-data:www-data with mode 660. Nginx runs as nginx user — it has no permission to connect.

Fix — set correct owner and mode in pool config

; /etc/php/8.1/fpm/pool.d/www.conf
listen.owner = nginx
listen.group = nginx
listen.mode  = 0660

systemctl reload php-fpm8.1
ls -la /run/php/php8.1-fpm.sock

srw-rw---- 1 nginx nginx 0 Apr  6 14:41 /run/php/php8.1-fpm.sock

Errors stopped immediately. Retry overhead disappeared.

Phase 8 — CloudFront CDN: Eliminating Origin Load for Static Content

Before CDN, every request for images, CSS, JavaScript, and downloadable files hit the EC2 origin directly, consuming PHP-FPM workers, EC2 CPU, and RDS bandwidth even for completely static content.

Traffic breakdown before CDN

Using Nginx access logs to understand what the origin was serving:

awk '{print $7}' /var/log/nginx/access.log | \
  sed 's/\?.*$//' | \
  grep -oE '\.(jpg|jpeg|png|gif|css|js|pdf|woff2)$' | \
  sort | uniq -c | sort -rn | head -10

  48291 .jpg
  31204 .css
  28841 .js
  12047 .png
   4821 .pdf
   2103 .woff2

Over 127,000 requests for static assets hitting the origin per log rotation period — none of them needed to touch EC2 or the database at all.

CloudFront configuration

# Create CloudFront distribution via AWS CLI
aws cloudfront create-distribution \
  --origin-domain-name your-alb.us-east-1.elb.amazonaws.com \
  --default-cache-behavior \
    ViewerProtocolPolicy=redirect-to-https,\
    CachePolicyId=658327ea-f89d-4fab-a63d-7e88639e58f6

Cache behaviour rules configured

Path pattern	Cache TTL	Rationale
`.jpg, .png, *.gif`	86400s (24h)	Images don’t change without a filename change
`.css, .js`	3600s (1h)	Versioned by deploy — safe to cache
`*.pdf`	86400s (24h)	Downloadable files, static
`/api/*`	0s (no cache)	Dynamic content, must reach origin
`/admin/*`	0s (no cache)	Authenticated pages, must reach origin

Verify cache is working

curl -I https://yourplatform.com/assets/logo.png

HTTP/2 200
x-cache: Hit from cloudfront      ← served from edge, never touched EC2
age: 42831                         ← cached for 42831 seconds (nearly 12 hours)

After CloudFront was configured, over 85% of all requests were served from CloudFront edge nodes. Origin traffic dropped dramatically, reducing EC2 CPU, Nginx worker usage, and RDS connection pool pressure simultaneously.

Phase 9 — Swap Configuration (Prevent OOM Kills)

With RAM now properly sized and PHP-FPM capped, swap was no longer being used for overflow — but it still needed to be correctly configured as a safety net.

# Check current swap and swappiness
free -m
cat /proc/sys/vm/swappiness

# Default is often 60 - too aggressive for a server workload
60

A swappiness of 60 means the kernel starts moving pages to swap when 40% of RAM is still free. On an application server, this causes unnecessary I/O. Lower it:

# Apply immediately
sudo sysctl vm.swappiness=10

# Make permanent
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf

A value of 10 tells the kernel: prefer to use RAM, only use swap when really necessary. This kept the disk I/O low during normal operations and reserved swap as a genuine last resort.

Results Summary

Metric	Before	After
PHP-FPM workers (concurrent)	60-70	8-12
Avg PHP-FPM query-induced wait	18-24 seconds	<200ms
RAM used (4GB instance)	3.6 GB (93%)	1.4 GB (36%)
Swap used	843 MB	0 MB
EBS disk `await`	48ms (saturated)	3ms
DB queries on hot paths	Full table scan (2.8M rows)	Index scan (15K rows)
Redis cache hit rate	0% (no cache)	97.1%
CloudFront cache hit rate	0% (no CDN)	85%+
EC2 instances required	6	2 (with headroom)
AWS monthly cost	Baseline	~80% lower

Key Takeaways

1. Always trace from symptom to root cause before scaling

Adding EC2 instances to handle load was masking the problem. The real issue was slow queries holding PHP-FPM workers hostage. Scaling out would have made the database worse, not better.

2. EXPLAIN is the most important tool in database debugging

type: ALL and key: NULL in EXPLAIN output means full table scan — it will destroy performance at scale. Add the index, run EXPLAIN again, confirm type: ref.

3. PHP-FPM pool configuration must be calculated, not guessed

The formula max_children = available_RAM / avg_worker_RAM × 0.80 is simple but it must be grounded in real measurements from your own server with ps — not copied from a blog post.

4. Permissions errors are never minor

A single Permission denied on a socket generates retry loops, wastes connections, and compounds across thousands of requests per minute. Check dmesg and Nginx error logs regularly.

5. CDN is free EC2 capacity

Every static asset served by CloudFront is a request that never touches your server. For a SaaS platform with 1000+ tenants each requesting assets, this is enormous. Configure it on day one.

6. Redis does double duty

Moving sessions off disk simultaneously fixes I/O saturation and eliminates a class of race conditions in multi-worker environments (where two workers might write to the same session file). The cache layer on top reduces database load further.

Frequently Asked Questions

How do you calculate the correct pm.max_children value for PHP-FPM?

Use the formula: pm.max_children = (Available RAM for PHP-FPM) / (Average memory per PHP-FPM worker). Measure average worker memory with ps --no-headers -o "rss,cmd" -C php-fpm, subtract OS overhead from total RAM, divide, then apply a ~20% safety buffer for traffic spikes and OS page cache.

How do you find slow MySQL queries in production?

Enable the MySQL slow query log with slow_query_log=1 and long_query_time=1. Also enable log_queries_not_using_indexes to catch full table scans. Use SHOW PROCESSLIST to see live queries and EXPLAIN to analyze query execution plans. Look for type: ALL (full table scan) and key: NULL (no index used).

Why are PHP-FPM workers stuck in sleeping state?

Sleeping PHP-FPM workers are blocked waiting for I/O — usually a slow database query. When a query takes 20+ seconds, the worker is held hostage for the entire duration. PHP-FPM spawns more workers to handle incoming requests, which consumes all available RAM and triggers swap usage.

Should I use gp2 or gp3 EBS volumes on AWS?

Use gp3. It provides 3000 IOPS baseline regardless of volume size (gp2 only gives 3 IOPS/GB), costs ~20% less, and supports live migration with zero downtime using aws ec2 modify-volume. There is no reason to use gp2 for new workloads.

How does Redis reduce PHP-FPM worker count?