Parsing XML with Ruby

Tháng Hai 17, 2009

Source: http://railstips.org/2008/8/12/parsing-xml-with-ruby

Just for kicks and giggles, I decided to parse xml with each of the main libraries in Ruby (REXML, Hpricot, libxml-ruby), so I could see the differences between them in both API (getting at elements and attributes) and speed. I did two different xml formats. The first, Delicious, uses an attribute based approach, and the second, Twitter, uses a more elemental one. If you look at the xml files linked below, the previous sentence might make more sense.

Note: This is not for scientific and speed purposes but rather to get a feel for each of the libraries and how you traverse xml nodes and such with them.

The XML

Here are the files I used for reference. You’ll have to view source once you click on one of these links to actually see the xml.

  • posts.xml – Uses xml element for object (post) and xml attributes for object attributes
  • timeline.xml – Uses xml element for object (status) and child xml elements for attributes

REXML

Pros: In the standard library
Cons: Slow, I don’t like the name

%w[benchmark pp rexml/document].each { |x| require x }

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = REXML::Document.new(xml), []
  doc.elements.each('posts/post') do |p|
    posts << p.attributes
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = REXML::Document.new(xml), []
  doc.elements.each('statuses/status') do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.elements[a].text
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.elements['user'].elements[a].text
    end
    statuses << h
  end
  # pp statuses
}

Hpricot

Pros: Cool name, created by _why, faster than REXML, also does HTML, creative API
Cons: Not as fast as libxml-ruby, more of an HTML parser linguistically (ie: uses innerHTML instead of text or content, etc.)

%w[benchmark pp rubygems].each { |x| require x }
gem 'hpricot', '>= 0.6'
require 'hpricot'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  doc, posts = Hpricot::XML(xml), []
  (doc/:post).each do |p|
    posts << p.attributes
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  doc, statuses = Hpricot::XML(xml), []
  (doc/:status).each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.at(a).innerHTML
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.at('user').at(a).innerHTML
    end
    statuses << h
  end
  # pp statuses
}

libxml-ruby

Pros: Blistering fast
Cons: Hpricot has cooler name, REXML and Hpricot both feel easier to use out of the box

%w[benchmark pp rubygems].each { |x| require x }
gem 'libxml-ruby', '>= 0.8.3'
require 'xml'

##################################
# Parsing Delicious API Response #
##################################
xml = File.read('posts.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, posts = parser.parse, []
  doc.find('//posts/post').each do |p|
    posts << p.attributes.inject({}) { |h, a| h[a.name] = a.value; h }
  end
  # pp posts
}

################################
# Parsing Twitter API Response #
################################
xml = File.read('timeline.xml')
puts Benchmark.measure {
  parser, parser.string = XML::Parser.new, xml
  doc, statuses = parser.parse, []
  doc.find('//statuses/status').each do |s|
    h = {:user => {}}
    %w[created_at id text source truncated in_reply_to_status_id in_reply_to_user_id favorited].each do |a|
      h[a.intern] = s.find(a).first.content
    end
    %w[id name screen_name location description profile_image_url url protected followers_count].each do |a|
      h[:user][a.intern] = s.find('user').first.find(a).first.content
    end
    statuses << h
  end
  # pp statuses
}

Conclusion

I’ll probably start using libxml-ruby but Hpricot is more fun (and I’ve used it a ton). Oh, if you are curious, this was the output from the scripts above on my machine.

=rexml
delicious     0.020000   0.000000   0.020000 (  0.021139)
twitter       0.940000   0.020000   0.960000 (  0.988666)

=hpricot
delicious     0.010000   0.000000   0.010000 (  0.005548)
twitter       0.250000   0.010000   0.260000 (  0.258320)

=libxml-ruby
delicious     0.000000   0.000000   0.000000 (  0.007829)
twitter       0.030000   0.010000   0.040000 (  0.034040)

The twitter one is slower because of the loops and hashes most likely. I doubt it has much to do with the actual parsing, though it is a larger file and would be a bit slower.

Gửi phản hồi

Mời bạn điền thông tin vào ô dưới đây hoặc kích vào một biểu tượng để đăng nhập:

WordPress.com Logo

Bạn đang bình luận bằng tài khoản WordPress.com Log Out / Thay đổi )

Twitter picture

Bạn đang bình luận bằng tài khoản Twitter Log Out / Thay đổi )

Facebook photo

Bạn đang bình luận bằng tài khoản Facebook Log Out / Thay đổi )

Google+ photo

Bạn đang bình luận bằng tài khoản Google+ Log Out / Thay đổi )

Connecting to %s

%d bloggers like this: