Extracting email addresses with Ruby
Tuesday, October 16th, 2007Today’s challenge was refactoring code we use for extracting email addresses from a contacts export file. The requirements were extraction from CSV, tab delimited, or LDIF files typically exported from your desktop email program’s address book.
The original developer’s approach was to go hunt down some rubygems or plug-ins with which to parse these files “properly” according to their file type. Turns out there’s plenty of ready made code available for this. The original developer’s approach was just fine and would have worked perfectly after we fixed a couple very minor problems.
Taking a step back and thinking about the problem brought me to the realization that all I really cared about was extracting unique email addresses from any file in which those email addresses are human-readable. Regular Expressions are the nearly perfect answer to this problem!
Here’s my solution presented as a one line standalone program that can take stdin (Standard Input). Put this in a file named email_extractor, place that file in your path, and make it executable:
#!/usr/bin/env ruby # if you don’t like my email matching regex, replace it with your own… STDIN.read.gsub(’mail=‘, ‘‘).scan(/[A-Z0-9._=%-]+@[A-Z0-9.-]+.[A-Z]{2,4}/i).uniq.sort.each { |email| puts email }
If you’re paying attention, you may be wondering why I have the gsub(’mail=’, ”) in there. I found that LDIF export files contain lines like this which would cause my code to mistakenly return ‘mail=name@domain.com’ as an email address:
dn: mail=name@domain.com
Then invoke it like this and watch in amazement as any email address found in the source file is found, sorted, and displayed to stdout:
cat any_file_containing_email_addresses | email_extractor
Of course, you can take my code and wrap it up in a function for inclusion in your program and revise it as needed. For example, in my case today, I was refactoring some already existing code which returned a result in a specific array format. So I ended up with this one line solution to replace the previous 25 lines of code which returns the same array structure as the original function did:
def extract_emails_from_file @uploaded_file.read.gsub(’mail=‘, ‘‘).scan(/[A-Z0-9._=%-]+@[A-Z0-9.-]+.[A-Z]{2,4}/i).uniq.sort.map{ |email| [’‘, email] } rescue return [] end
This simple bit of refactoring resulted in a much more flexible email address importing solution because it will work with any file type. Plus it didn’t break any of the tests we’d already written!
Credit goes to my friend K. Adam Christensen for suggesting I apply some “Rubyisms” to condense my original 3 lines of code down to one line.