Lighttpd, 500 errors, and FCGI load balancing… (part 1)
January 9th, 2008Today’s challenge is to figure out why some of our site’s visitors are occasionally seeing status code 500 (Internal Server Error). First stop is the Lighttpd error logs from one of the application servers where I found entries like this:
2008-01-09 11:21:43: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000 2008-01-09 11:21:42: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000 2008-01-09 11:21:43: (mod_fastcgi.c.3181) response not received, request sent: 1386 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection 2008-01-09 11:21:42: (mod_fastcgi.c.3181) response not received, request sent: 1497 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection 2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000 2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1336 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection 2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000 2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1295 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection 2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000 2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000 2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000 2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
Looks like Lighttpd is occasionally handing requests off to fcgi processes that have died. Simple enough…. but why are they dying?
Next stop is a look at some output from this command:
ps -eFH | grep fcgi | grep -v grep
as shown here:
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD lighttpd 30189 1 0 20492 23320 3 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5 lighttpd 30200 1 0 101967 219912 2 Jan08 ? 00:01:01 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30202 1 0 101364 217368 2 Jan08 ? 00:00:20 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30204 1 0 58803 57416 0 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30206 1 0 58805 57416 3 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30208 1 0 58805 57420 0 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30210 1 0 58803 57416 2 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30212 1 0 58802 57412 2 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30214 1 0 58803 57416 3 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30216 1 0 58805 57420 1 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30218 1 0 58803 57416 0 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30222 1 0 58805 57420 3 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30225 1 0 58805 57420 2 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30227 1 0 58804 57416 2 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30229 1 0 58805 57420 1 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30231 1 0 58803 57416 3 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 30233 1 0 58805 57416 1 Jan08 ? 00:00:03 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 32106 1 1 103389 225564 3 11:08 ? 00:01:15 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 32164 1 0 100314 213196 3 11:08 ? 00:00:16 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 5059 1 44 106262 237152 0 11:47 ? 00:17:53 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 8606 1 14 98805 207184 3 12:26 ? 00:00:08 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
Our current production deployment stack is a simple FCGI configuration. Each application server has 20 instances of dispatch.fcgi waiting for Lighttpd to hand off requests to them. They service the request, return the result to Lighttpd, then Lighttpd returns the request to the user’s browser.
In our case, more than half of the dispatch.fcgi processes appear to be untouched according to the memory usage indicated in the RSS column (meaning they have not yet “bloated” their memory consumption beyond the initial amount). Since first deploying the application many months ago I’ve always known this was the case and I always assumed it was due to Lighttpd not getting enough concurrent requests to need to call upon those idle processes.
Timestamp values in the STIME column indicate we have 4 fcgi processes which were recently started by spawner. Spawner (the first process shown in the above list) simply checks to make sure the specified number of processes exist and spawns new processes if needed. In an ideal world, we’d start 20 processes and they would faithfully handle requests forever…. but such is not the case.
The astute observer will have noticed the processes which are being used are all consuming 200+ MB of RAM each! We can probably attribute much of the heavy RAM usage to the typically inefficient manner in which in-memory objects are created when ActiveRecord, for the sake of convenience, loads many columns of data into memory we aren’t going to need at that moment. This is another problem to be discussed and debated (at length) at some other time.
I was running an older version of Lighttpd (1.4.13) on these servers and hadn’t bothered to upgrade because 1) it has been working pretty well, and 2) I’ve had plenty of other things to do. On the off-chance that Lighttpd might be at fault here I decided to have a look at the change logs for the newer versions of Lighttpd. A few things in the change logs for the newer versions caught my eye:
from the 1.4.14 change log: * fix http 500 errors (colin.stephen/at/o2.com) #1041 [1663] * Added round-robin support to mod_fastcgi [1500] * fix cpu hog in certain requests [1473] CVE-2007-1869from the 1.4.16 change log: * fixed check on stale errno values, which broke handling of broken fastcgi applications. (#1245) from the 1.4.18 change log: * fixed FastCGI header overrun in mod_fastcgi (reported by mattias@secweb.se)
The first two I listed from 1.4.14 especially interested me…
So I downloaded the latest Lighttpd source (currently version 1.4.18) and compiled/installed it on one of the application servers to see if anything changes for the better. I immediately noticed something very interesting from the ps output as shown here:
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD lighttpd 11946 1 0 20444 23248 3 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5 lighttpd 11959 1 0 126376 317356 3 Jan08 ? 00:06:43 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11961 1 0 106143 236516 1 Jan08 ? 00:06:19 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11965 1 0 103744 227076 2 Jan08 ? 00:05:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11967 1 0 103759 227156 1 Jan08 ? 00:05:14 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11969 1 0 102879 223588 1 Jan08 ? 00:04:23 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11971 1 0 103456 225780 2 Jan08 ? 00:05:15 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11979 1 0 105246 233092 3 Jan08 ? 00:05:40 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11981 1 0 103516 226188 3 Jan08 ? 00:04:55 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11986 1 0 103541 226244 1 Jan08 ? 00:04:21 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11988 1 0 103077 224392 2 Jan08 ? 00:04:27 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11990 1 0 102421 221764 2 Jan08 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 18924 1 0 122312 301352 1 Jan08 ? 00:11:05 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 12774 1 9 104090 228480 3 10:03 ? 00:18:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 12776 1 4 103632 226648 3 10:03 ? 00:07:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 11144 1 4 102462 221940 1 11:25 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 14011 1 4 102638 222600 1 11:44 ? 00:03:34 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 14515 1 4 104432 229508 3 11:49 ? 00:03:30 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 17360 1 4 102608 222544 3 12:16 ? 00:02:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 21006 1 7 103863 227544 1 12:46 ? 00:01:37 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi lighttpd 23795 1 8 101678 218672 1 12:59 ? 00:00:49 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
Based on values in the RSS column it seems that Lighttpd is now getting all of the fcgi processes involved. This is what the change log meant by “Added round-robin support to mod_fastcgi”. Now I’ve got 20 bloated processes in memory eating up almost 5 GB of RAM (around 230MB each). It’s a good thing I have 8GB in these servers!
Let’s have another look at the Lighttpd error log:
2008-01-09 11:05:13: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000 2008-01-09 11:05:14: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:15: (log.c.135) server stopped 2008-01-09 11:20:17: (log.c.75) server started 2008-01-09 11:33:15: (network_openssl.c.133) SSL: 5 -1 104 Connection reset by peer 2008-01-09 11:33:15: (connections.c.603) connection closed: write failed on fd 8 2008-01-09 11:44:38: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012 2008-01-09 11:44:38: (mod_fastcgi.c.3254) response not received, request sent: 855 on socket: tcp:127.0.0.1:7012 for /dispatch.fcgi , closing connection 2008-01-09 13:11:15: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 34776 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle 2008-01-09 13:11:27: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 36064 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle 2008-01-09 13:27:00: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7018 2008-01-09 13:27:00: (mod_fastcgi.c.3254) response not received, request sent: 1226 on socket: tcp:127.0.0.1:7018 for /dispatch.fcgi , closing connection 2008-01-09 13:28:56: (connections.c.279) SSL: 1 error:140940E5:SSL routines:SSL3_READ_BYTES:ssl handshake failure
We can see where Lighttpd was restarted at 11:20:17, and unfortunately with some continuing instances of “unexpected end-of-file (perhaps the fastcgi process died)”.
Early indications are that we’re probably better off than before due to the now functional round-robin distribution of requests. As a by-product of the improved round-robin request handling I suspect I may have reduced the frequency of 500 errors seen by the users because Lighttpd will now send the request to the next fcgi process in the round-robin pool whereas before it was not using all 20 of the available processes. I will go ahead and install this latest version of Lighttpd on all the application servers.
However, the fcgi processes are still dying….. which means I get to figure out why.
The fun never ends.