Archive for November, 2007

Index your data already!

Thursday, November 29th, 2007

So I’m still toiling away building this Ruby-on-Rails based web application and decided to take a break from coding optimization and have a look at the MySQL slow query log files on the database servers.

Although I found several good examples of slowly executing queries, I’ll just talk about this one:

SELECT profiles.`id` AS t0_r0, profiles.`user_id` AS t0_r1, profiles.`type` AS t0_r2, profiles.`screen_name` AS t0_r3, profiles.`postal_code` AS t0_r4, profiles.`updated_at` AS t0_r5, profiles.`created_at` AS t0_r6, profiles.`real_name` AS t0_r7, profiles.`short_name` AS t0_r8, profiles.`city` AS t0_r9, profiles.`state` AS t0_r10, profiles.`hometown` AS t0_r11, profiles.`quotation` AS t0_r12, profiles.`bio` AS t0_r13, profiles.`statements` AS t0_r14, profiles.`achievements` AS t0_r15, profiles.`specialties` AS t0_r16, profiles.`inspirations` AS t0_r17, profiles.`country` AS t0_r18, profiles.`channel_id` AS t0_r19, profiles.`styles` AS t0_r20, profiles.`featured_at` AS t0_r21, profiles.`featured_by_id` AS t0_r22, profiles.`bookmarked_by_count` AS t0_r23, profiles.`group_members` AS t0_r24, profiles.`looking_for` AS t0_r25, profiles.`work_with` AS t0_r26, profiles.`favorite_events` AS t0_r27, profiles.`fan_of` AS t0_r28, profiles.`favorite_talent` AS t0_r29, profiles.`favorite_music` AS t0_r30, profiles.`favorite_films` AS t0_r31, profiles.`favorite_performances` AS t0_r32, profiles.`membership_criteria` AS t0_r33, profiles.`skills` AS t0_r34, profiles.`materials` AS t0_r35, profiles.`exhibitions` AS t0_r36, profiles.`views` AS t0_r37, profiles.`soapbox` AS t0_r38, profiles.`user_activated` AS t0_r39, profiles.`phone_number` AS t0_r40, profiles.`address` AS t0_r41, profiles.`role_cache` AS t0_r42, profiles.`genre_cache` AS t0_r43, profiles.`upload_total_bytes_cache` AS t0_r44, profiles.`upload_item_count_cache` AS t0_r45, profiles.`popularity_factor` AS t0_r46, profiles.`views_today` AS t0_r47, profiles.`portfolio_views_today` AS t0_r48, profiles.`bookmarkings_today` AS t0_r49, profiles.`acknowledgements_today` AS t0_r50, profiles.`popularity_data` AS t0_r51, profiles.`popularity_factor_raw` AS t0_r52, profiles.`display_admin_info` AS t0_r53, profiles.`places` AS t0_r54, profiles.`where` AS t0_r55, profiles.`who` AS t0_r56, profiles.`craziest` AS t0_r57, profiles.`associations` AS t0_r58, profiles.`primary_image_id` AS t0_r59, profiles.`custom_url_name` AS t0_r60, profiles.`custom_url_hits` AS t0_r61, profiles.`web_presence_data` AS t0_r62, profiles.`has_portfolio_image` AS t0_r63, profiles.`has_portfolio_audio` AS t0_r64, profiles.`has_portfolio_remote_video` AS t0_r65, profiles.`has_portfolio_writing` AS t0_r66, profiles.`last_portfolio_item_created_at` AS t0_r67, profiles.`has_portfolio_video` AS t0_r68, profiles.`spotlighted_at` AS t0_r69, profiles.`clan_name` AS t0_r70, genres.`id` AS t1_r0, genres.`channel_id` AS t1_r1, genres.`name` AS t1_r2, genres.`display` AS t1_r3, roles.`id` AS t2_r0, roles.`name` AS t2_r1, roles.`display` AS t2_r2, channels.`id` AS t3_r0, channels.`name` AS t3_r1, channels.`ad_click_id` AS t3_r2, channels.`ad_zone_id` AS t3_r3, channels.`indieclicks_zone_id` AS t3_r4, portfolios.`id` AS t4_r0, portfolios.`profile_id` AS t4_r1, portfolios.`type` AS t4_r2, portfolios.`created_at` AS t4_r3, portfolios.`updated_at` AS t4_r4
FROM profiles
LEFT OUTER JOIN profile_genres ON ( profiles.`id` = profile_genres.`profile_id` )
LEFT OUTER JOIN genres ON ( genres.`id` = profile_genres.`genre_id` )
LEFT OUTER JOIN profile_roles ON ( profiles.`id` = profile_roles.`profile_id` )
LEFT OUTER JOIN roles ON ( roles.`id` = profile_roles.`role_id` )
LEFT OUTER JOIN channels ON channels.id = profiles.channel_id
LEFT OUTER JOIN portfolios ON portfolios.profile_id = profiles.id
AND portfolios.`type` = 'FreePortfolio'
WHERE ((profiles.channel_id =9)
AND (profiles.user_activated =1))
AND (featured_at IS NOT NULL)
AND profiles.id IN ('1051301', '1047701', '805801', '1016601', '566501', '215501', '605501', '301901', '781401', '768001', '912401', '911301')
ORDER BY featured_at DESC

A Rails developer will immediately recognize that as a typical SQL statement created by ActiveRecord to support our insatiable desire for OOPH (Object Oriented Programming Happiness). There IS a potential dark side to the dot notation we’ve come to love.

Slow query log had this to say about it:

# Query_time: 5  Lock_time: 0  Rows_sent: 2847  Rows_examined: 5645986

Notice the Rows_examined value – 5,645,986 rows! It’s a good thing our database servers have a crap-ton of RAM in them along with super-fast RAID disk arrays :-)

I then manually re-executed that same query several times and it consistently needed 1 to 3 seconds to run. Time to EXPLAIN things a little more or rather have MySQL explain it for me. Running EXPLAIN on that same query revealed the need to index the profile_id field in two tables (profile_genres and profile_roles).

After taking a minute to create the new indexes (including making new migration files and whatnot), I ran EXPLAIN again and was pleased to see MySQL needed to look at only 18 rows of data now instead of 5,645,986. Furthermore, the SAME query that previously needed several seconds was now consistently finishing in 0.007 seconds.

How the hell did we let this happen?!?! Simple: when juggling too many balls some inevitably get dropped. Ok, that’s only part of it. Another part is resisting the temptation to do premature optimizations to your systems. Arguably, in this case, the indexes should have been created when those data associations were set up. This is why we use the slow query logging and must periodically check for surprises like this.

Keep in mind that data indexing comes at a price. Indexing can be a double-edged sword in the sense that those indexes must be maintained when writing data to indexed tables. It is quite possible, even easy, to kill database write performance if you go too far and index too many things trying to speed up read performance.

Checking your slow query logs frequently and making cautious indexing changes will help more than most people realize. Especially when your tables grow beyond a few thousand rows.

Sunday Ride Report

Sunday, November 11th, 2007

I only have time for the highlights…

  • Perfect weather.
  • A dozen skilled riders all wearing the appropriate safety gear.
  • No crashes.
  • No financial penalties.
  • Countless excellent roads south of SR60. Thanks to Dave for the tour guide service!
  • Surprisingly good food at the “trough” (Golden Corral) in Lake Placid.
  • 400.28 miles for me today.
  • I need a new rear tire. I’m going to try the new dual compound Pilot Road 2.

Concurrent Image Processing – Part One

Wednesday, November 7th, 2007

I’m building a new web site using the Ruby on Rails framework. Our users will be uploading countless image files (eventually there will be many millions of image files!). After each image file is uploaded, and depending on where that image will later be displayed, we must crop and resize it to make various sized thumbnail versions and display versions of the originally uploaded image file. Sounds simple, right? If only… In a perfect world:

  • The image would be optimized with respect to pixel dimensions and quality/compression. It would be no larger than the largest display size we use in the site and would be suitably compressed for display on a web page. This would reduce upload time, conserve filesystem space on the server, ensure the fastest possible page loading time, and reduce our bandwidth costs.
  • GIF, JPG, and PNG formats would be used appropriately.
  • They would all be RGB instead of CMYK. The CMYK vs RGB issues are complicated, but generally speaking CMYK is for print and RGB is for computer screens.

However our world isn’t so perfect because we see a lot of:

  • 8+ megapixel images straight from the user’s camera, each consuming several megabytes of disk space and taking many minutes for the user to upload.
  • Absolutely no attempt at JPG compression.
  • GIF files that most certainly needed to be JPG and vice versa.
  • the occasional CMYK file which doesn’t survive the conversion to RGB.

In reality, it’s just a whole lot easier if we let the user upload whatever they want and we’ll “fix” it on the server side. Understandably, most users have no clue about resizing, cropping, JPG compression, and all the other things we must deal with.

The real fun begins AFTER the file is uploaded. There are two general approaches for handling the server-side processing of the user’s uploaded image.

  1. Serialized processing: In the same process thread that is serving up the web page, we could do all the cropping, resizing, and compressing that is needed in a step-by-step serialized fashion followed by placing the resulting set of images on the image servers. There are 2 main problems with this approach: 1) The user’s browser session is literally put on hold and the browser will appear to be frozen during this processing and 2) During the processing, that connection to the web server is tied up and cannot service other requests thus making the application less scalable. This also happens to be the easiest way to do it, and in fact is how many sites handle such things.
  2. Concurrent processing: With almost no interruption of the user’s browsing session (other than the time it takes for the initial upload), we can place unprocessed uploads into a queue and have a SEPARATE process do all the work. This allows the user to continue using the site while their upload is being processed in the background on a different server that is dedicated to image processing. This processing approach is going to require more time to architect and implement.

For obvious reasons, I wanted to take the concurrent processing approach. To ensure that our application will scale up to millions of users, we are using a cluster of servers. Each of the servers handles a specific set of tasks related to front-end, image serving, image processing, audio serving, audio processing, database cluster, etc. The initial upload goes to the particular application server a particular user is hitting.

So, those are the issues….. what do we do?

Conceptually, the steps are actually simple and I’ll break it down by server type:

  • Application Servers: User uploads a new image (we call them assets) to whichever application server is handling their session. Behind the scenes, this consists of an application server receiving the uploaded file and placing the uploaded file in a pre-determined location on the application server which makes it available for “pick up” by an asset_processor program. The uploaded item is then inserted into an “asset processing queue”.
  • Asset Processing Servers: We have several servers each running multiple instances of an asset_processor.rb script which frequently looks in the “asset processing queue” for work to do. When a new item is found in the queue that isn’t already being processed, the asset_processor timestamps the started_at field for the item, copies the item to a local working directory and begins the processing of the item. When the processing is finished, the final set of files is copied over to the appropriate Asset Servers such that the world can get to them. After processing, the originally uploaded files are removed from the application server they were uploaded to and the new asset is made “active”. During the time period between the initial upload and completion of processing, the asset is considered to be “inactive” which causes our application to serve up placeholder slug images. As soon as the asset is made active by the processing script, the application magically starts serving up the processed images.
  • Asset Servers: we have images, audio, and (eventually) video asset servers. Fully processed assets are stored on these web servers.

Of course, no article about web development using Ruby on Rails is worth reading unless I include at least a little bit of code…

In “Part Two” of this article I will discuss several sections of the code doing some of the above described work.