Parsing XML with Ruby

Tháng Hai 17, 2009

Source: http://railstips.org/2008/8/12/parsing-xml-with-ruby

Just for kicks and giggles, I decided to parse xml with each of the main libraries in Ruby (REXML, Hpricot, libxml-ruby), so I could see the differences between them in both API (getting at elements and attributes) and speed. I did two different xml formats. The first, Delicious, uses an attribute based approach, and the second, Twitter, uses a more elemental one. If you look at the xml files linked below, the previous sentence might make more sense.

Note: This is not for scientific and speed purposes but rather to get a feel for each of the libraries and how you traverse xml nodes and such with them.

The XML

Here are the files I used for reference. You’ll have to view source once you click on one of these links to actually see the xml.

  • posts.xml – Uses xml element for object (post) and xml attributes for object attributes
  • timeline.xml – Uses xml element for object (status) and child xml elements for attributes

REXML

Pros: In the standard library
Cons: Slow, I don’t like the name

%w[benchmark pp rexml/document].each { |x| require x }

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = REXML::Document.new(xml), []
  doc.elements.each('posts/post') do |p|
    posts << p.attributes
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = REXML::Document.new(xml), []
  doc.elements.each('statuses/status') do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.elements[a].text
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.elements['user'].elements[a].text
    end
    statuses << h
  end
  # pp statuses
}

Hpricot

Pros: Cool name, created by _why, faster than REXML, also does HTML, creative API
Cons: Not as fast as libxml-ruby, more of an HTML parser linguistically (ie: uses innerHTML instead of text or content, etc.)

%w[benchmark pp rubygems].each { |x| require x }
gem 'hpricot', '>= 0.6'
require 'hpricot'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = Hpricot::XML(xml), []
  (doc/:post).each do |p|
    posts << p.attributes
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = Hpricot::XML(xml), []
  (doc/:status).each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.at(a).innerHTML
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.at('user').at(a).innerHTML
    end
    statuses << h
  end
  # pp statuses
}

libxml-ruby

Pros: Blistering fast
Cons: Hpricot has cooler name, REXML and Hpricot both feel easier to use out of the box

%w[benchmark pp rubygems].each { |x| require x }
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, posts = parser.parse, []
  doc.find('//posts/post').each do |p|
    posts << p.attributes.inject({}) { |h, a| h[a.name] = a.value; h }
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, statuses = parser.parse, []
  doc.find('//statuses/status').each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.find(a).first.content
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.find('user').first.find(a).first.content
    end
    statuses << h
  end
  # pp statuses
}

Conclusion

I’ll probably start using libxml-ruby but Hpricot is more fun (and I’ve used it a ton). Oh, if you are curious, this was the output from the scripts above on my machine.

=rexml
delicious     0.020000   0.000000   0.020000 (  0.021139)
twitter       0.940000   0.020000   0.960000 (  0.988666)

=hpricot
delicious     0.010000   0.000000   0.010000 (  0.005548)
twitter       0.250000   0.010000   0.260000 (  0.258320)

=libxml-ruby
delicious     0.000000   0.000000   0.000000 (  0.007829)
twitter       0.030000   0.010000   0.040000 (  0.034040)

The twitter one is slower because of the loops and hashes most likely. I doubt it has much to do with the actual parsing, though it is a larger file and would be a bit slower.

REXML: Processing XML in Ruby

Tháng Hai 12, 2009

REXML (Ruby Electric XML) is the XML processor of choice for Ruby programmers. It comes bundled with the standard Ruby distribution. It’s fast, written in Ruby, and can be used in two ways: tree parsing and stream parsing. In this article, we show some basic constructs on how to use REXML for XML processing. We also introduce the use of Ruby’s interactive debugger irb for exploring XML documents with the help of REXML. We’ll be using a DocBook bibliography file as example XML document. You will learn how to parse the document with the tree parsing API, to access elements and attributes, and to create and insert elements. We’ll also look into the peculiarities of text nodes and entity processing. Finally, we will show an example use of the stream parsing API

Follow the tutorial here:

http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html