Archive for October, 2007

Extracting email addresses with Ruby

Tuesday, October 16th, 2007

Today’s challenge was refactoring code we use for extracting email addresses from a contacts export file. The requirements were extraction from CSV, tab delimited, or LDIF files typically exported from your desktop email program’s address book.

The original developer’s approach was to go hunt down some rubygems or plug-ins with which to parse these files “properly” according to their file type. Turns out there’s plenty of ready made code available for this. The original developer’s approach was just fine and would have worked perfectly after we fixed a couple very minor problems.

Taking a step back and thinking about the problem brought me to the realization that all I really cared about was extracting unique email addresses from any file in which those email addresses are human-readable. Regular Expressions are the nearly perfect answer to this problem!

Here’s my solution presented as a one line standalone program that can take stdin (Standard Input). Put this in a file named email_extractor, place that file in your path, and make it executable:

#!/usr/bin/env ruby
# if you don’t like my email matching regex, replace it with your own…
STDIN.read.gsub(’mail=‘, ‘).scan(/[A-Z0-9._=%-]+@[A-Z0-9.-]+.[A-Z]{2,4}/i).uniq.sort.each { |email| puts email }

If you’re paying attention, you may be wondering why I have the gsub(’mail=’, ”) in there. I found that LDIF export files contain lines like this which would cause my code to mistakenly return ‘mail=name@domain.com’ as an email address:

dn: mail=name@domain.com

Then invoke it like this and watch in amazement as any email address found in the source file is found, sorted, and displayed to stdout:

cat any_file_containing_email_addresses | email_extractor

Of course, you can take my code and wrap it up in a function for inclusion in your program and revise it as needed. For example, in my case today, I was refactoring some already existing code which returned a result in a specific array format. So I ended up with this one line solution to replace the previous 25 lines of code which returns the same array structure as the original function did:

def extract_emails_from_file
  @uploaded_file.read.gsub(’mail=‘, ‘).scan(/[A-Z0-9._=%-]+@[A-Z0-9.-]+.[A-Z]{2,4}/i).uniq.sort.map{ |email| [’‘, email] }
rescue
  return []
end

This simple bit of refactoring resulted in a much more flexible email address importing solution because it will work with any file type. Plus it didn’t break any of the tests we’d already written!

Credit goes to my friend K. Adam Christensen for suggesting I apply some “Rubyisms” to condense my original 3 lines of code down to one line.

Sunday Ride Report

Sunday, October 14th, 2007

The Sunday ride, for me, usually means getting out of bed at 0615 such that I have time get ready and be at the designated breakfast meeting place by 0800. Today I was meeting up with Jimmy and crew in Mt. Dora at Highland Cafe….and they were riding up from as far south as Bradenton (in the case of ZX-10 Chris). This meant I could sleep until 0830 and still have plenty of time to meet them for 1000 breakfast in Mt. Dora.

Just last evening I had studied the weather predictions and was looking forward to the much cooler weather (down in the 60’s) this morning and a full day of sunshine. Imagine my surprise when I looked out the window to see nothing but dark clouds, wet streets, and the rain was still coming down. A quick check of the weather radar tricked me into believing it was probably going to be bright and clear over toward the planned riding area.

Around 0910 I rolled out of the garage onto the wet pavement into the steady rain. 40 minutes later, as I was nearing the Mt. Dora area I was bummed out to still be riding in light rain, on wet roads, and clouds as far as I could see in all directions.

GSXR Jimmy, GSXR Ian, and ZX10 Chris, and Tuono Jeff rolled into the Highland Cafe parking lot at 1000. Even though they already had breakfast much earlier, Jimmy was not going to be denied the best grits to be found anywhere. Before we got a table, Jeff had to leave and missed out on the fun we’d have later. Sometime around 1100 we were ready to roll out.

With pavement still wet and clouds hanging low, we charged ahead anyway and traveled several of the Orlando crew’s favorite roads. Jimmy wanted to lead the group on some back roads between 19 and 450 which are all south of 42 that I’ve never been on before. We headed west on 42 from 19 and soon turned left onto these wonderful little back roads I knew nothing about. There was a solid 20 minutes+ of great riding back in there I simply must remember to show Orlando Rick in case he hasn’t seen them before.

From there we did Emerlda off of 452 in both directions, coming back out to 452 and 42. Left turn over to 182 at which point regular readers know exactly what happened next. Because of my little 3rd gear wet pavement spin-up incident through here last week I was not going to attempt my normal pace today on 182. Not to worry though…. Chris and Ian way more than made up for it by leaving Jimmy and I far behind. Except for the few miles where Jimmy and I ran our GSXR1000’s as hard as they could possibly go. Jimmy: are we totally f’ing insane??? or what?

At what is usually considered the “north end” of our Ocala loop, Jimmy had a few more surprises in store for me. Today I learned there’s another fantastic 20+ minutes of GOOD roads northwest of the 40/314a intersection where Rick and I usually stop for a beverage. I hope I will remember them all for next time I’m up there with the Orlando riders.

Back down to the intersection of 42 and 182. I turned left to head home while the others went right to begin the _long_ journey back to Tampa and far beyond. I rolled into my garage with around 200 miles logged for the day.

Just before we split up I was a little shocked to see my low fuel light blinking at the 100 mile mark today, but I quickly realized there was good reason for this. We had been running at a very spirited pace for over an hour.

Jimmy: It is clear to me that you’ve invested some serious time into the reconnaissance of the areas we rode today. And many other areas up there too. Good Job!!!

Ian: It was great to see you after what? 5+ years.

Chris: Was it a 500+ mile ride for you today??

Linear Interpolation with Ruby

Monday, October 8th, 2007

Who needs curve-fitting when the much simpler Linear Interpolation will suffice?

# Linear interpolation. Takes two known data points, say (xa,ya) and (xb,yb),
# and the interpolant is given by:
#
#      y = ya + ((x - xa) * (yb - ya) / (xb - xa)) at the point (x,y)
#
# Arguments are:
#
#    ** known_data_points is a hash of key => value pairs where key is
#       the "x" values and value is the "y" values. Example hash:
#         {  0 => 10,
#           10 => 90,
#           95 => 280,
#          100 => 300 }
#
#    ** x is the known point somewhere in the range of "x" values
#
# Solves for y, a point somewhere in the range of supplied y values which
# is relative to the position of the x point in the supplied x values.
#
# Example results of calling the function with the above example hash
# and these known x values:
#     x = 5    => 50
#     x = 10   => 90
#     x = 50   => 179
#
# Thusly, you can supply as FEW or as MANY known data points as is needed to approximate
# any imaginable curve fitting equation. Simply keep in mind that linear interpolation is
# being performed between the two closest known points. You may only need 4 or 5 known data
# points to represent a gentle sloping curve WHEREAS you may need 15 or more known points
# to approximate a more aggressive curve shape.
#
def interpolate(known_data_points, x)

  xmin = 0
  xmax = 0

  # find the first known x value at or below the provided x value…
  known_data_points.keys.sort.reverse_each do |k|
    if k <= x
      xmin = k
      break
    end
  end

  # find the first known x value at or above the provided x value…
  known_data_points.keys.sort.each do |k|
    if k >= x
      xmax = k
      break
    end
  end

  # if supplied argument "x" is outside the range of known "x" values, bail out now!
  raise InterpolationError if x > xmax || x < xmin

  # if supplied argument "x" falls exactly on a known x data point, simply return
  # the relative y value now!
  return known_data_points[xmax] if x == xmax
  return known_data_points[xmin] if x == xmin

  # prevent divide by zero errors…
  if (xmax - xmin) == 0
    raise InterpolationError
  end

  # finally, interpolate and return the answer!
  return known_data_points[xmin] + (((x - xmin) * (known_data_points[xmax] - known_data_points[xmin])) / (xmax - xmin))

end

Sunday Ride Report

Sunday, October 7th, 2007

Left my house in Deltona at 0655 and took the leisurely route to northwest Apopka to meet up with SV1000 Rick Norton at Ken’s Restaurant at 0800. It was already somewhat cloudy. DR650 Steve and BMW Don also joined us for breakfast. For the record, I think Robinson Restaurant just south of Ken’s has much better bacon strips. Ken’s bacon strips are puny little limp strips of mostly fat whereas Robinson has better quality much larger crispy strips. Problems like this can ruin the whole day for some people.

By around 0900 we were back on the road headed for the northern “Ocala” loop. Steve had other things to do and didn’t ride with us and Don cut his ride short after around 45 minutes (probably trying to avoid the rain). At times it even looked like we might have a rain-free day, but I had just looked at the time-lapse weather radar on my phone and knew that was probably wishful thinking. Since my bike was already dirty from the last time I rode in the rain, I didn’t care.

Rick led the way at a moderately spirited pace as we traveled the usual route up toward the much loved 182nd Ave. where we discovered that the pavement was mostly dry….and then I was motioned into the lead position. I would soon learn the pavement wasn’t as dry as I needed it to be while exiting a particular left-handed curve. I was down in 3nd gear with my TRE (Timing Retard Eliminator) turned on. I gave it a bit too much throttle and felt the rear spin up. As I was easing back out of the throttle I started questioning my desire for wanting more low-mid power from a GSXR-1000.

We continued the planned route up to Hwy 40 where we stopped for a short break. Clouds were moving in fast but we still had hopes of avoiding the rain. I wanted to get Rick’s opinion of my bike with and without the TRE turned on, so I switched it off and we swapped bikes. The rain started very soon after the rest stop. Even though we did stop after a while to turn the TRE on, the pavement from that point forward was rarely dry enough for Rick to open the throttle in the lower gears…so we’ll do that again on dry pavement soon.

I was home by 12:30 with around 150 miles logged for the day. Oh yeah…. almost zero love bugs today until a few minutes before I got home.

ActiveRecord vs. Me

Thursday, October 4th, 2007

Most of the time AR (ActiveRecord) makes Rails developers happy…. when they’re not thinking about (or suffering from) the depth of the call-stack involved. Don’t believe me? Fire up ruby-debug and start single-stepping down into the AR call-stack for the grand tour. While you’re at it, check out this nice ruby-debug tutorial at railscasts.com

For the past many weeks I have been keeping a close eye on a complicated data processing job that runs on the back-end of our site (TalentDatabase.com). Starting with the very first time it ran, I was never happy about how long it took to finish and less happy about the load it put on the application and database servers while running.

I had done everything “correctly” as far as Rails best-practices and AR were concerned…. but it was still unacceptably sloooowwwww.

In a nutshell, we have a table/model containing around 8,000 records and the program makes several iterative passes through the data while calculating the result for each record. From an Object Oriented Programming perspective, it was damn convenient to simply write code like this:

Foo.find(:all).each do |foo|
  foo.datafield = result
  foo.save
end

and then plow through the data doing what needed to be done while happily letting AR take care of all the database interaction. But this was one of those cases where all the extra baggage AR brings along with it was really slowing things down. A LOT.

Fact of the matter is, there was a combination of factors contributing to this problem.

  • AR does a lot of ugly hard work for you by assembling and executing some very scary looking SQL statements against your data server. In my case, the issue was with the UPDATE statements including every single field in the table…which in turn means MySQL must expend even more CPU cycles maintaining indexes. Index maintenance is a GOOD thing, except for the fact that I really only needed to change the value of a single indexed data field instead of forcing MySQL to do index maintenance on the other dozen or so fields that are also indexed in that table.
  • MySQL is certainly not at fault for doing whatever is needed to maintain my data indexes. However, it can bring an entire database-driven web site to a crawl if it’s too busy to quickly service browsing requests.
  • AR simply isn’t well suited for the kind of bulk data processing I was trying to do.

Finally, I’d had enough of it and decided it was time to try it without using AR. I revised the program to use ‘mysql’ directly and rolled my own SQL statements from top to bottom. The results were very pleasing, as follows:

Before, using ActiveRecord:

Finished pass #1 in 2.498 seconds
Finished pass #2 in 533.655 seconds
Finished pass #3 in 515.770 seconds
Processed 7563 records in 1051.923 seconds (7.190 per second)

After, with my own SQL:

Finished pass #1 in 1.206 seconds
Finished pass #2 in 3.426 seconds
Finished pass #3 in 2.112 seconds
Processed 7658 records in 6.744 seconds (1135.468 per second)

Considering that nothing changed other than how I’m accessing the database in this otherwise complicated sequence of calculations, this is a HUGE improvement in execution time.