Archive for the 'Linux' Category

Java - JBoss - Seam - Eclipse Development Stack

Thursday, September 25th, 2008

Here’s what I have running well at the moment (likely to change over time as needs change):

  • JDK 1.6 update 7
  • mysql-connector-java (I’m using version 5.1.6)
  • JBoss Application Server 4.2.3.GA
    • To make deployment to this application server possible/easier from my non-privileged user account:
      • chown -R jerry:jerry $JBOSS_HOME
    • I am controlling this only from the command line after observing that launching it from within Eclipse causes the load on my workstation to get very high for no apparent reason.
      • cd $JBOSS_HOME/bin
      • ./run.sh
    • Since I’m also using MySQL, I have copied the mysql-connector-java-5.1.6-bin.jar file into my $JBOSS_HOME/server/default/lib
  • Seam 2.1.0.GA
    • this Seam Getting Started guide will be helpful - and there’s plenty more good info in the Reference Guide.
    • If you’re serious about Seam development, get “Seam in Action” - it’s available as a PDF file too!
    • I do all the seam-gen stuff from the command line - it’s really easy once you get into the “flow” of doing it this way (and eliminates any issues of doing it from within your IDE). This mostly amounts to only a few commands once you start working in your code:
      • seam explode (to get things started)
      • seam restart (to compile and deploy your latest code changes)
      • As the project grows and evolves, it is likely you’ll end up with customized build and deployment command-line scripts anyway.
  • Eclipse 3.4.1 (in my case, eclipse-jee-ganymede-SR1-linux-gtk-x86_64.tar.gz)
    • Nightly build of the JBoss IDE Tools. I am using JBossTools-200810010752-nightly-ALL-linux-gtk-x86_64.zip - however a more recent build should work just as well.
      • Using a recent nightly build solved my problem with the JBoss Tools Palette not working with Eclipse 3.4 as described here
    • Debugging seems to work well (as long as the server is started in “debug” mode from inside Eclipse)
    • JBoss Tools Palette works well.
    • Looks like everything is working for now.

Here’s an important chunk of my .bashrc file:

export JAVA_HOME=/usr/local/java/latest
export PATH=$JAVA_HOME/bin:$PATH

export ANT_HOME=/usr/local/apache-ant-1.7.1
export PATH=$ANT_HOME/bin:$PATH

export JBOSS_HOME=/usr/local/jboss/jboss_as/latest
export PATH=$JBOSS_HOME/bin:$PATH

export SEAM_HOME=/usr/local/jboss/seam/latest
export PATH=$SEAM_HOME:$PATH

The “latest” directory in the paths above is the result of my use of symlinks like this:

jerry@localhost
/usr/local/jboss/seam >ll
total 8
drwxr-xr-x 11 root root 4096 2008-10-23 23:03 jboss-seam-2.1.0.CR1
drwxr-xr-x 12 root root 4096 2008-10-24 15:03 jboss-seam-2.1.0.GA
lrwxrwxrwx  1 root root   20 2008-10-24 15:12 latest -> jboss-seam-2.1.0.GA/

We are sorry for the inconvenience.

Wednesday, September 3rd, 2008

Thankfully, I won’t need to see this for much longer:

visual studio crash

Whenever pair-programming with one of my co-workers, we see this one very often. I am sure their apology is sincere. My favorite part is the “No information has been lost” - when in fact we have seen plenty of instances where there WAS information lost when he couldn’t hit the CTRL-S keys quickly enough. There have been many other strange observations during my recent adventures with the .NET platform.

Soon I will be starting at a new gig where I’ll be using old familiar friends (and some new ones) like:

Of course, this means I’ll probably start writing about the new set of challenges and problems instead of these old ones :-)

Windows vs Linux

Tuesday, August 5th, 2008

I could literally start writing (whining) now about all the things I absolutely hate about Windows and never ever stop. I’ll spare you by listing only one of my most favorite complaints - I just adore this little dialog box that pops up almost EVERY f#^*!ng time you install “Windows Updates”:

Automatic Updates Dialog

Woohoo - yet another reboot to interrupt whatever else it is I’m trying to do. And HERE is where the one of the differences is - I almost never need to reboot my Linux workstations unless I’m upgrading the Linux Kernel itself.

Several friends have asked me: “What’s the difference between Linux and Windows and why would I want to try it?”

My typical answer goes something like this: If you’re a desktop user needing the “basics” such as web browsing, word processing, spreadsheets, and email - without the need for specialized Windows applications like AutoCAD or QuickBooks - then you’re probably a perfect candidate for trying out one of the modern Linux distributions such as Fedora or Ubuntu. Beyond that, you should enjoy more stability and nearly 100% freedom from worry about viruses. It is also worth noting that many current Linux distributions will probably work flawlessly with all your computer’s hardware (network, audio, video, printer, etc). Wireless network adapters have historically been difficult to configure in Linux - but that is becoming less of a problem. With Windows, you will very likely need to obtain and install additional drivers for those components.

Having been a long-time Red Hat Linux user (since around 1996), I personally use Fedora Linux because it is basically Red Hat and I’m very familiar with every part of it. Since then I have used many of the other Linux distributions and even worked professionally with them, but I always come back to Fedora.

However, for me, as a software developer and systems administrator, the differences go much deeper - right down to the kernel of each operating system.

To learn more, there’s PLENTY of additional information available on this subject.

Gigabit Network Performance ??

Thursday, June 5th, 2008

Are you really seeing the full benefit of your gigabit network infrastructure? Today we’re trying to figure out why it seems the private network interconnect links used by our Oracle RAC are causing problems.

First thing is to verify that the NIC ports appear to be configured properly:

root@backup01
/backup/mysql_snapshots >ethtool eth0
Settings for eth0:
        Supported ports: [ MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Current message level: 0×000000ff (255)
        Link detected: yes

Then we’ll move on to some basic link testing. My two favorite utilities for testing network throughput at a relatively low level are:

Here’s a quick example of how both are used…. first we’ll use ttcp:

Set up the receiving server…

root@images01
~ >ttcp -r -s -f M

Then run the test from the transmitting server…

root@backup01
~ >ttcp -t -s -l 16384 -n 16384 -f M 10.0.0.7
ttcp-t: buflen=16384, nbuf=16384, align=16384/0, port=5001  tcp  -> 10.0.0.7
ttcp-t: socket
ttcp-t: connect
ttcp-t: 268435456 bytes in 2.26 real seconds = 113.51 MB/sec +++
ttcp-t: 16384 I/O calls, msec/call = 0.14, calls/sec = 7264.61
ttcp-t: 0.0user 0.3sys 0:02real 15% 0i+0d 0maxrss 0+5pf 199+2csw

Now we can use the handy bandwidth units conversion tool found here to see that 113.51 MB/sec equals 0.9522 Gbps (gigabits/sec). Nice - nearly the full 1.0 Gbps we wanted to see! I ran the above test with various values for -l and -n with similar results.

Let’s test again - this time using iperf:

Set up the receiving server…

root@images01
~ >iperf -s

Then run the test from the transmitting server…

root@backup01
~ >iperf -c 10.0.0.7
————————————————————
Client connecting to 10.0.0.7, TCP port 5001
TCP window size: 16.0 KByte (default)
————————————————————
[  3] local 10.0.0.6 port 52323 connected with 10.0.0.7 port 5001
[  3]  0.0-10.0 sec  1.09 GBytes    936 Mbits/sec

Again, nearly the full 1.0 Gbps is being achieved!

However, keep in mind that these tests do not take things into account which happen higher up in the protocol stack - such as when you’re using SCP or FTP to copy files across the network.

For example, I checked the speed of an SCP file transfer between the same 2 servers that I used above. You may be surprised to see what appears to be terrible performance like this:

root@backup01
/backup/mysql_snapshots >scp mysql_replicated.2008-06-05-040005.tgz images01:/home/
mysql_replicated.2008-06-05-040005.tgz                                    100%  334MB  33.4MB/s   00:10

Hmmm…. 33.4MB/s is only 0.2802 Gbps. What happened?! Actually, a lot of extra things happened all of which eat up time but the single biggest one is the added overhead of the SSH encryption used during the SCP copy job.

The point being - when you’re examining network performance, you must remember to take EVERYTHING into account. Just having a gigabit switch and some el-cheapo gigabit network adapters is no guarantee of anything. Here’s a VERY incomplete list of things to consider:

  • How are your adapters connected to the system? 64-bit PCI slot filled with a high-end NIC? A port built into the main board? A $12 “gigabit NIC” you found at the local wholesale supplier?
  • Hard drive speeds in the case of copying files or doing anything else involving read/write operations to disk storage
  • What else is happening concurrently?
  • Using the correct drivers?
  • Are you certain the patch cables are OK?
  • Are you certain the ports are in fact set at 1000Mbit/sec speed?
  • Is there anything strange happening at the switch level? VLAN misconfigured? Routing problem?
  • have you eliminated the possibility of DNS trouble? Tried the IP address(es) directly?

If gigabit still isn’t fast enough, step up to link aggregation per 802.3ad (along with the more costly class of high-end switches you’ll need to take full advantage of it). If your budget is flexible enough check out the newer 10 gigabit options.

There’s plenty of additional information online about these topics - google.com is your friend.

HOWTO: Flash in x86_64 Linux

Tuesday, January 15th, 2008

Adobe has, for whatever reason, never released a 64-bit version of Flash. Thankfully, the 32-bit Firefox works just fine in 64-bit Linux - which means we can use the Flash player in spite of their short-sighted decision.

I know there are dozens of ways to do this. Here’s what works for me quick-and-easy in Fedora 8 (although it should work with most other flavors of Linux too):

1. Download the latest i686 (32-bit) version from http://www.mozilla.com/en-US/

2. “Install” Firefox as follows:

# uncompress and untar it...
tar zxvf firefox-2.0.0.11.tar.gz

# move the resulting firefox directory to /usr/local
mv firefox /usr/local/

# make sure it works...
cd /usr/local/firefox
./firefox

# close the browser for now...

On one of my workstations I didn’t already have the i386 version of compat-libstdc++ installed (which caused Firefox to complain about a missing libstdc++.so.5). Easy fix for that:

yum install compat-libstdc++-33.i386

3. Download the latest Flash player in tar.gz format from: http://www.adobe.com

4. “Install” Flash as follows:

# uncompress and untar it...
tar zxvf install_flash_player_9_linux.tar.gz

# give the resulting directory a shorter, more proper name...
mv install_flash_player_9_linux flash

# move it to /usr/local
mv flash /usr/local

# set up the needed symbolic link for firefox to find and use flash...
cd /usr/local/firefox/plugins/

ln -s /usr/local/flash/libflashplayer.so libflashplayer.so

5. Use your menu editor (in my case the KDE Menu Editor) to make sure the correct firefox gets launched from the menu. If you already had a version of firefox installed, it’s probably already set to use:

/usr/bin/firefox

Just change that to be:

/usr/local/firefox/firefox

6. Done! Launch Firefox and visit a site that uses Flash, such as youtube.com and enjoy.

Note: It is likely that you will need elevated privileges for the directory moving and symbolic linking operations. I simply do the whole thing as root to make it easy.

Lighttpd, 500 errors, and FCGI load balancing… (part 2)

Friday, January 11th, 2008

So I very recently upgraded Lighttpd on 3 production application servers to see what, if any, effect it would have on users seeing error 500 and/or the FCGI processes dying. I would be looking for lines similar to this in the lighttpd error logs:

2008-01-10 22:26:56: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012

Time for some command line magic to figure out the death count on the FCGI processes in the recent past. I ran this command on each of the servers:

for d in 01 02 03 04 05 06 07 08 09 10; do echo -n "2008-01-$d   "; tail -20000 lighttpd.error.log | grep died | grep -c "2008-01-$d"; done

For the 3 servers, results were:

Server: app01
2008-01-01   39
2008-01-02   115
2008-01-03   571
2008-01-04   119
2008-01-05   98
2008-01-06   101
2008-01-07   882
2008-01-08   100
2008-01-09   52
2008-01-10   4
2008-01-11   7

Server: app02
2008-01-01   27
2008-01-02   21
2008-01-03   38
2008-01-04   136
2008-01-05   114
2008-01-06   48
2008-01-07   62
2008-01-08   138
2008-01-09   29
2008-01-10   55
2008-01-11   7

Server: app03
2008-01-01   19
2008-01-02   26
2008-01-03   236
2008-01-04   51
2008-01-05   28
2008-01-06   38
2008-01-07   49
2008-01-08   300
2008-01-09   59
2008-01-10   6
2008-01-11   3

When I saw the result from app01, I got excited until I looked at 55 failures for app02 on 2008-01-10. However, early indications are that a substantial reduction was made in the failures as a by-product of fully utilizing all the available FCGI processes (even though I haven’t done a thing _yet_ toward figuring out why they’re dying in the first place).

I will keep monitoring this as I continue the search for a permanent solution. I am considering an architecture change to eliminate FCGI and start using the new mod_proxy_core in Lighttpd 1.5 in front of multiple Mongrels or maybe even try out Thin.

Lighttpd, 500 errors, and FCGI load balancing… (part 1)

Wednesday, January 9th, 2008

Today’s challenge is to figure out why some of our site’s visitors are occasionally seeing status code 500 (Internal Server Error). First stop is the Lighttpd error logs from one of the application servers where I found entries like this:

2008-01-09 11:21:43: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:21:42: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:21:43: (mod_fastcgi.c.3181) response not received, request sent: 1386 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:21:42: (mod_fastcgi.c.3181) response not received, request sent: 1497 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1336 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:11: (mod_fastcgi.c.2396) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7000
2008-01-09 11:47:11: (mod_fastcgi.c.3181) response not received, request sent: 1295 on socket: tcp:127.0.0.1:7000 for /dispatch.fcgi , closing connection
2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000
2008-01-09 11:47:12: (mod_fastcgi.c.2689) establishing connection failed: Connection refused socket: tcp:127.0.0.1:7000
2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:47:18: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000

Looks like Lighttpd is occasionally handing requests off to fcgi processes that have died. Simple enough…. but why are they dying?

Next stop is a look at some output from this command:

ps -eFH | grep fcgi | grep -v grep

as shown here:

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
lighttpd 30189     1  0 20492 23320   3 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5
lighttpd 30200     1  0 101967 219912 2 Jan08 ?        00:01:01   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30202     1  0 101364 217368 2 Jan08 ?        00:00:20   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30204     1  0 58803 57416   0 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30206     1  0 58805 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30208     1  0 58805 57420   0 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30210     1  0 58803 57416   2 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30212     1  0 58802 57412   2 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30214     1  0 58803 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30216     1  0 58805 57420   1 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30218     1  0 58803 57416   0 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30222     1  0 58805 57420   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30225     1  0 58805 57420   2 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30227     1  0 58804 57416   2 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30229     1  0 58805 57420   1 Jan08 ?        00:00:02   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30231     1  0 58803 57416   3 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 30233     1  0 58805 57416   1 Jan08 ?        00:00:03   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 32106     1  1 103389 225564 3 11:08 ?        00:01:15   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 32164     1  0 100314 213196 3 11:08 ?        00:00:16   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd  5059     1 44 106262 237152 0 11:47 ?        00:17:53   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd  8606     1 14 98805 207184  3 12:26 ?        00:00:08   /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi

Our current production deployment stack is a simple FCGI configuration. Each application server has 20 instances of dispatch.fcgi waiting for Lighttpd to hand off requests to them. They service the request, return the result to Lighttpd, then Lighttpd returns the request to the user’s browser.

In our case, more than half of the dispatch.fcgi processes appear to be untouched according to the memory usage indicated in the RSS column (meaning they have not yet “bloated” their memory consumption beyond the initial amount). Since first deploying the application many months ago I’ve always known this was the case and I always assumed it was due to Lighttpd not getting enough concurrent requests to need to call upon those idle processes.

Timestamp values in the STIME column indicate we have 4 fcgi processes which were recently started by spawner. Spawner (the first process shown in the above list) simply checks to make sure the specified number of processes exist and spawns new processes if needed. In an ideal world, we’d start 20 processes and they would faithfully handle requests forever…. but such is not the case.

The astute observer will have noticed the processes which are being used are all consuming 200+ MB of RAM each! We can probably attribute much of the heavy RAM usage to the typically inefficient manner in which in-memory objects are created when ActiveRecord, for the sake of convenience, loads many columns of data into memory we aren’t going to need at that moment. This is another problem to be discussed and debated (at length) at some other time.

I was running an older version of Lighttpd (1.4.13) on these servers and hadn’t bothered to upgrade because 1) it has been working pretty well, and 2) I’ve had plenty of other things to do. On the off-chance that Lighttpd might be at fault here I decided to have a look at the change logs for the newer versions of Lighttpd. A few things in the change logs for the newer versions caught my eye:

from the 1.4.14 change log:
  * fix http 500 errors (colin.stephen/at/o2.com) #1041 [1663]
  * Added round-robin support to mod_fastcgi [1500]
  * fix cpu hog in certain requests [1473] CVE-2007-1869from the 1.4.16 change log:
  * fixed check on stale errno values, which broke handling of broken fastcgi applications. (#1245)

from the 1.4.18 change log:
  * fixed FastCGI header overrun in mod_fastcgi (reported by mattias@secweb.se)

The first two I listed from 1.4.14 especially interested me…

So I downloaded the latest Lighttpd source (currently version 1.4.18) and compiled/installed it on one of the application servers to see if anything changes for the better. I immediately noticed something very interesting from the ps output as shown here:

UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
lighttpd 11946 1 0 20444 23248 3 Jan08 ? 00:00:02 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/script/process/spawner fcgi -p 7000 -i 20 -r 5
lighttpd 11959 1 0 126376 317356 3 Jan08 ? 00:06:43 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11961 1 0 106143 236516 1 Jan08 ? 00:06:19 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11965 1 0 103744 227076 2 Jan08 ? 00:05:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11967 1 0 103759 227156 1 Jan08 ? 00:05:14 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11969 1 0 102879 223588 1 Jan08 ? 00:04:23 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11971 1 0 103456 225780 2 Jan08 ? 00:05:15 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11979 1 0 105246 233092 3 Jan08 ? 00:05:40 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11981 1 0 103516 226188 3 Jan08 ? 00:04:55 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11986 1 0 103541 226244 1 Jan08 ? 00:04:21 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11988 1 0 103077 224392 2 Jan08 ? 00:04:27 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11990 1 0 102421 221764 2 Jan08 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 18924 1 0 122312 301352 1 Jan08 ? 00:11:05 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 12774 1 9 104090 228480 3 10:03 ? 00:18:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 12776 1 4 103632 226648 3 10:03 ? 00:07:28 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 11144 1 4 102462 221940 1 11:25 ? 00:04:45 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 14011 1 4 102638 222600 1 11:44 ? 00:03:34 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 14515 1 4 104432 229508 3 11:49 ? 00:03:30 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 17360 1 4 102608 222544 3 12:16 ? 00:02:36 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 21006 1 7 103863 227544 1 12:46 ? 00:01:37 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi
lighttpd 23795 1 8 101678 218672 1 12:59 ? 00:00:49 /usr/bin/ruby /home/lighttpd/theinspirit.com/current/public/dispatch.fcgi

Based on values in the RSS column it seems that Lighttpd is now getting all of the fcgi processes involved. This is what the change log meant by “Added round-robin support to mod_fastcgi”. Now I’ve got 20 bloated processes in memory eating up almost 5 GB of RAM (around 230MB each). It’s a good thing I have 8GB in these servers!

Let’s have another look at the Lighttpd error log:

2008-01-09 11:05:13: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:05:14: (mod_fastcgi.c.2635) fcgi-server re-enabled: tcp:127.0.0.1:7000
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:15: (log.c.135) server stopped
2008-01-09 11:20:17: (log.c.75) server started
2008-01-09 11:33:15: (network_openssl.c.133) SSL: 5 -1 104 Connection reset by peer
2008-01-09 11:33:15: (connections.c.603) connection closed: write failed on fd 8
2008-01-09 11:44:38: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7012
2008-01-09 11:44:38: (mod_fastcgi.c.3254) response not received, request sent: 855 on socket: tcp:127.0.0.1:7012 for /dispatch.fcgi , closing connection
2008-01-09 13:11:15: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 34776 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2008-01-09 13:11:27: (server.c.1253) NOTE: a request for /dispatch.fcgi timed out after writing 36064 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2008-01-09 13:27:00: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:7018
2008-01-09 13:27:00: (mod_fastcgi.c.3254) response not received, request sent: 1226 on socket: tcp:127.0.0.1:7018 for /dispatch.fcgi , closing connection
2008-01-09 13:28:56: (connections.c.279) SSL: 1 error:140940E5:SSL routines:SSL3_READ_BYTES:ssl handshake failure

We can see where Lighttpd was restarted at 11:20:17, and unfortunately with some continuing instances of “unexpected end-of-file (perhaps the fastcgi process died)”.

Early indications are that we’re probably better off than before due to the now functional round-robin distribution of requests. As a by-product of the improved round-robin request handling I suspect I may have reduced the frequency of 500 errors seen by the users because Lighttpd will now send the request to the next fcgi process in the round-robin pool whereas before it was not using all 20 of the available processes. I will go ahead and install this latest version of Lighttpd on all the application servers.

However, the fcgi processes are still dying….. which means I get to figure out why.

The fun never ends.

Index your data already!

Thursday, November 29th, 2007

So I’m still toiling away building this Ruby-on-Rails based web application and decided to take a break from coding optimization and have a look at the MySQL slow query log files on the database servers.

Although I found several good examples of slowly executing queries, I’ll just talk about this one:

SELECT profiles.`id` AS t0_r0, profiles.`user_id` AS t0_r1, profiles.`type` AS t0_r2, profiles.`screen_name` AS t0_r3, profiles.`postal_code` AS t0_r4, profiles.`updated_at` AS t0_r5, profiles.`created_at` AS t0_r6, profiles.`real_name` AS t0_r7, profiles.`short_name` AS t0_r8, profiles.`city` AS t0_r9, profiles.`state` AS t0_r10, profiles.`hometown` AS t0_r11, profiles.`quotation` AS t0_r12, profiles.`bio` AS t0_r13, profiles.`statements` AS t0_r14, profiles.`achievements` AS t0_r15, profiles.`specialties` AS t0_r16, profiles.`inspirations` AS t0_r17, profiles.`country` AS t0_r18, profiles.`channel_id` AS t0_r19, profiles.`styles` AS t0_r20, profiles.`featured_at` AS t0_r21, profiles.`featured_by_id` AS t0_r22, profiles.`bookmarked_by_count` AS t0_r23, profiles.`group_members` AS t0_r24, profiles.`looking_for` AS t0_r25, profiles.`work_with` AS t0_r26, profiles.`favorite_events` AS t0_r27, profiles.`fan_of` AS t0_r28, profiles.`favorite_talent` AS t0_r29, profiles.`favorite_music` AS t0_r30, profiles.`favorite_films` AS t0_r31, profiles.`favorite_performances` AS t0_r32, profiles.`membership_criteria` AS t0_r33, profiles.`skills` AS t0_r34, profiles.`materials` AS t0_r35, profiles.`exhibitions` AS t0_r36, profiles.`views` AS t0_r37, profiles.`soapbox` AS t0_r38, profiles.`user_activated` AS t0_r39, profiles.`phone_number` AS t0_r40, profiles.`address` AS t0_r41, profiles.`role_cache` AS t0_r42, profiles.`genre_cache` AS t0_r43, profiles.`upload_total_bytes_cache` AS t0_r44, profiles.`upload_item_count_cache` AS t0_r45, profiles.`popularity_factor` AS t0_r46, profiles.`views_today` AS t0_r47, profiles.`portfolio_views_today` AS t0_r48, profiles.`bookmarkings_today` AS t0_r49, profiles.`acknowledgements_today` AS t0_r50, profiles.`popularity_data` AS t0_r51, profiles.`popularity_factor_raw` AS t0_r52, profiles.`display_admin_info` AS t0_r53, profiles.`places` AS t0_r54, profiles.`where` AS t0_r55, profiles.`who` AS t0_r56, profiles.`craziest` AS t0_r57, profiles.`associations` AS t0_r58, profiles.`primary_image_id` AS t0_r59, profiles.`custom_url_name` AS t0_r60, profiles.`custom_url_hits` AS t0_r61, profiles.`web_presence_data` AS t0_r62, profiles.`has_portfolio_image` AS t0_r63, profiles.`has_portfolio_audio` AS t0_r64, profiles.`has_portfolio_remote_video` AS t0_r65, profiles.`has_portfolio_writing` AS t0_r66, profiles.`last_portfolio_item_created_at` AS t0_r67, profiles.`has_portfolio_video` AS t0_r68, profiles.`spotlighted_at` AS t0_r69, profiles.`clan_name` AS t0_r70, genres.`id` AS t1_r0, genres.`channel_id` AS t1_r1, genres.`name` AS t1_r2, genres.`display` AS t1_r3, roles.`id` AS t2_r0, roles.`name` AS t2_r1, roles.`display` AS t2_r2, channels.`id` AS t3_r0, channels.`name` AS t3_r1, channels.`ad_click_id` AS t3_r2, channels.`ad_zone_id` AS t3_r3, channels.`indieclicks_zone_id` AS t3_r4, portfolios.`id` AS t4_r0, portfolios.`profile_id` AS t4_r1, portfolios.`type` AS t4_r2, portfolios.`created_at` AS t4_r3, portfolios.`updated_at` AS t4_r4
FROM profiles
LEFT OUTER JOIN profile_genres ON ( profiles.`id` = profile_genres.`profile_id` )
LEFT OUTER JOIN genres ON ( genres.`id` = profile_genres.`genre_id` )
LEFT OUTER JOIN profile_roles ON ( profiles.`id` = profile_roles.`profile_id` )
LEFT OUTER JOIN roles ON ( roles.`id` = profile_roles.`role_id` )
LEFT OUTER JOIN channels ON channels.id = profiles.channel_id
LEFT OUTER JOIN portfolios ON portfolios.profile_id = profiles.id
AND portfolios.`type` = 'FreePortfolio'
WHERE ((profiles.channel_id =9)
AND (profiles.user_activated =1))
AND (featured_at IS NOT NULL)
AND profiles.id IN ('1051301', '1047701', '805801', '1016601', '566501', '215501', '605501', '301901', '781401', '768001', '912401', '911301')
ORDER BY featured_at DESC

A Rails developer will immediately recognize that as a typical SQL statement created by ActiveRecord to support our insatiable desire for OOPH (Object Oriented Programming Happiness). There IS a potential dark side to the dot notation we’ve come to love.

Slow query log had this to say about it:

# Query_time: 5  Lock_time: 0  Rows_sent: 2847  Rows_examined: 5645986

Notice the Rows_examined value - 5,645,986 rows! It’s a good thing our database servers have a crap-ton of RAM in them along with super-fast RAID disk arrays :-)

I then manually re-executed that same query several times and it consistently needed 1 to 3 seconds to run. Time to EXPLAIN things a little more or rather have MySQL explain it for me. Running EXPLAIN on that same query revealed the need to index the profile_id field in two tables (profile_genres and profile_roles).

After taking a minute to create the new indexes (including making new migration files and whatnot), I ran EXPLAIN again and was pleased to see MySQL needed to look at only 18 rows of data now instead of 5,645,986. Furthermore, the SAME query that previously needed several seconds was now consistently finishing in 0.007 seconds.

How the hell did we let this happen?!?! Simple: when juggling too many balls some inevitably get dropped. Ok, that’s only part of it. Another part is resisting the temptation to do premature optimizations to your systems. Arguably, in this case, the indexes should have been created when those data associations were set up. This is why we use the slow query logging and must periodically check for surprises like this.

Keep in mind that data indexing comes at a price. Indexing can be a double-edged sword in the sense that those indexes must be maintained when writing data to indexed tables. It is quite possible, even easy, to kill database write performance if you go too far and index too many things trying to speed up read performance.

Checking your slow query logs frequently and making cautious indexing changes will help more than most people realize. Especially when your tables grow beyond a few thousand rows.

Lost in a sea of code…

Monday, June 18th, 2007

After leading the charge for 10 months on our ever-growing Ruby on Rails based application, it has long since reached the point where I can easily waste several minutes looking for something either I or one of the other developers has worked on. Here’s a little bash script I wrote to make things faster and easier to find:

#!/bin/sh
# /usr/local/bin/grepp (note the extra p on the end)
# assuming I'm already in the root of my app directory, I only want to look in these
locations="app bin lib"

for location in $locations; do
  echo "-----------------------------------------------------"
  echo "Searching in $location/*"
  grep -ir "$1" $location/* | grep -v svn
done

Super-Simple and saves me a lot of time when I need to quickly dig into it and modify things!

Quick-n-easy archiving of a directory or file

Friday, April 13th, 2007

It happens all the time. You’re about to do something potentially stupid and decide that a quickie backup might be nice to have. I’ve written a simple script that makes this painless.

Suppose you have a directory named ‘whatever’ and you want a quick, recursive, backup of whatever is in it:

   arc whatever

will produce the following file:

   whatever.2007-04-13-150933.tgz

My arc script automatically handles the tar, gzip, and timestamp operations for you!

Put this in your /usr/local/bin/ directory in a file named ‘arc’ and make it executable:

#!/bin/sh

DT=$(date +%Y-%m-%d-%H%M%S)
verbose="v"

if [ "$1" == "-q" ]; then
  shift
  verbose=""
fi

while [ "$1" != "" ]; do
  dir=$1
  dir=${dir///}
  tar czf$verbose $dir.$DT.tgz $dir
  shift
done

exit $?

Give it a first argument of -q to have it quietly execute with no output.

Of course, this script can also be called upon from inside other scripts to do very useful things.