Parsing Big XML Files with Nokogiri

Parsing XML files is a pretty common problem. There are tons of libraries out there to help accomplish this. At Viget, we typically use Nokogiri for our XML needs.

Recently, I was faced with the challenge of parsing a 60MB, 1.1+ million line XML document into a DOM (Document Object Model -- basically a traversable XML node tree). Nokogiri has a really fast XML parser that will generate a DOM for you. Totally awesome -- with one major caveat -- the entire DOM lives in memory. The bigger the XML document, the more memory required. My million-liner was using over 2GB's of RAM. With most modern setups, this isn't necessarily a problem but I do think it raises the question of whether or not we need to have the entire XML document in memory at once.

XML files can get much, much bigger than the 60MB one I had to deal with, so at some point RAM will bottleneck DOM parsing. If you want to be memory conscious, what are your options? Let's talk.

SAX Parsing

SAX (Simple API for XML) is an alternative parsing strategy that utilizes an event-based XML stream. With SAX, parsers move line-by-line, triggering events when elements are detected. Nokogiri provides a SAX parser. Let's take a look at an example:

SAX in Action

Given an example XML Document ('./test/fixtures/example.xml'):

<?xml version="1.0"?>
<Data>
  <Node>
    <Element Type="Boom">He's on fire!</Element>
  </Node>
  <Node>
    <Element Type="Shaka">Jams it in!</Element>
  </Node>
  <Node>
    <Element Type="Laka">Razzle Dazzle!</Element>
  </Node>
</Data>

Nokogiri's SAX parsing requires us to define a SAX document parser that defines the event-handling methods:

class Parser < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    # Handle each element, expecting the name and any attributes
  end

  def characters(string)
    # Any characters between the start and end element expected as a string
  end

  def end_element(name)
    # Given the name of an element once its closing tag is reached
  end
end

To begin SAX parsing, we'd use our Parser like this:

Nokogiri::XML::SAX::Parser.new(Parser.new).parse(File.open('./test/fixtures/example.xml'))

Nokogiri will stream our XML file line-by-line, calling out to start_element, characters, and end_element as it hits the opening tag, content, and closing tag. For example, say the SAX parser is handling our <Element Type="Boom">He's on fire!</Element> line, here's what we'd get if we had a binding (binding.pry or byebug) in each method:

# In start_element:
pry(main)> name
=> "Element"
pry(main)> attrs
=> [["Type", "Boom"]]

# In characters:
pry(main)> string
=> "He's on fire!"

# In end_element
pry(main)> name
=> "Element"

SAX is well-suited to pulling particular strings out of an XML document. If you're more concerned about the attributes or content of specific elements rather than the greater context or structure of a particular node/element, look to SAX!

DOM Parsing

Because XML is a document format used to capture a structured data tree, it almost always means that the structure is important. When parsing XML via SAX, you lose that tree structure. When XML is parsed via a DOM-parser, you get a fully traversable set of objects that mirrors the structure of the original XML file. As I mentioned earlier, Nokogiri's DOM parser is fast and awesome -- it just has to store the DOM in memory.

DOM in Action

The primary DOM parser in Nokogiri is called like this:

Nokogiri::XML(File.open('./test/fixtures/example.xml'))
# or..
Nokogiri::XML.parse(File.open('./test/fixtures/example.xml'))

You'll get a Nokogiri::XML::Document object back -- the DOM representation of the XML file. You can inspect the name, attributes, and child nodes from any point in the DOM. You can also use xpath to search the DOM.

For most use cases, this is probably all you need. If, however, you do need to been concerned about memory, there's a Nokogiri::XML::Reader that will stream the XML node-by-node.

Nokogiri::XML::Reader in Action

The XML reader is slightly more memory intensive than the SAX parser, but the advantage here is that you retain the XML structure of the given node. If we had millions of <Node> nodes but wanted to process each node as soon as it was read, we could extract those nodes one-at-a-time with the reader and shell out to some other object to handle processing of each node:

class NodeHandler < Struct.new(:node)
  def process
    # Node processing logic
  end
end

Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
  if node.name == 'Node' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    NodeHandler.new(
      Nokogiri::XML(node.outer_xml).at('./Node')
    ).process
  end
end

The node items the Reader yields to our block are Nokogiri::XML::Reader objects that represent an XML opening-type (Nokogiri::XML::Reader::TYPE_ELEMENT) node named 'Node'. Here's what we have inside our if block:

pry(main)> node.name
=> 'Node'
pry(main)> node.node_type
=> 1 # This means we're at an opening tag for a node
pry(main)> node.inner_xml
=> "\n    <Element Type=\"Boom\">He's on fire!</Element>\n  "
pry(main)> node.outer_xml
=> "<Node>\n    <Element Type=\"Boom\">He's on fire!</Element>\n  </Node>"

The reader objects represent nodes and let us inspect the surrounding XML content as strings. The outer_xml method gives us an XML string that captures exactly one full <Node> -- just what we need! To do our processing, we use the normal DOM parser to turn that XML string back into an XML DOM, then grab the node we want:

Nokogiri::XML(node.outer_xml).at('./Node')

This returns a Nokogiri::XML::Element-type object with a nice interface:

# Assuming our Nokogiri::XML::Element is stored in a variable named node:
pry(main)> node.name
=> "Node"
pry(main)> element = node.at('./Element')
=> #<Nokogiri::XML::Element:0x3fd5d9a8fcc0 name="Element" attributes=[#<Nokogiri::XML::Attr:0x3fd5d9a8fbf8 name="Type" value="Boom">] children=[#<Nokogiri::XML::Text:0x3fd5d9a8ef78 "He's on fire!">]>
pry(main)> element.name
=> "Element"
pry(main)> element['Type']
=> "Boom"
pry(main)> element.content
=> "He's on fire!"

In Conclusion

There are a number of ways to parse XML in Ruby, but when you're dealing with particularly large XML files -- look to Nokogiri's stream-based XML parsers. Depending on your needs, look to the Nokogiri::XML::SAX::Parser or Nokogiri::XML::Reader for great justice!

Ryan is a developer in Viget's Falls Church, VA, HQ, where he believes in being a liason for both the technical and non-technical. He builds elegant tools for clients such as Bozzuto and Millitello Capital—as well as internal tools that we use at Viget every day.

More posts by Ryan