Lighttpd, 500 errors, and FCGI load balancing… (part 2)
January 11th, 2008So I very recently upgraded Lighttpd on 3 production application servers to see what, if any, effect it would have on users seeing error 500 and/or the FCGI processes dying. I would be looking for lines similar to this in the lighttpd error logs:
2008-01-10 22:26:56: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012
Time for some command line magic to figure out the death count on the FCGI processes in the recent past. I ran this command on each of the servers:
for d in 01 02 03 04 05 06 07 08 09 10; do echo -n "2008-01-$d "; tail -20000 lighttpd.error.log | grep died | grep -c "2008-01-$d"; done
For the 3 servers, results were:
Server: app01 2008-01-01 39 2008-01-02 115 2008-01-03 571 2008-01-04 119 2008-01-05 98 2008-01-06 101 2008-01-07 882 2008-01-08 100 2008-01-09 52 2008-01-10 4 2008-01-11 7 Server: app02 2008-01-01 27 2008-01-02 21 2008-01-03 38 2008-01-04 136 2008-01-05 114 2008-01-06 48 2008-01-07 62 2008-01-08 138 2008-01-09 29 2008-01-10 55 2008-01-11 7 Server: app03 2008-01-01 19 2008-01-02 26 2008-01-03 236 2008-01-04 51 2008-01-05 28 2008-01-06 38 2008-01-07 49 2008-01-08 300 2008-01-09 59 2008-01-10 6 2008-01-11 3
When I saw the result from app01, I got excited until I looked at 55 failures for app02 on 2008-01-10. However, early indications are that a substantial reduction was made in the failures as a by-product of fully utilizing all the available FCGI processes (even though I haven’t done a thing _yet_ toward figuring out why they’re dying in the first place).
I will keep monitoring this as I continue the search for a permanent solution. I am considering an architecture change to eliminate FCGI and start using the new mod_proxy_core in Lighttpd 1.5 in front of multiple Mongrels or maybe even try out Thin.