Parsing Big XML Files with Nokogiri
Parsing XML files is a pretty common problem. There are tons of libraries out there to help accomplish this. At Viget, we typically use
Nokogiri for our XML needs.
Recently, I was faced with the challenge of parsing a 60MB, 1.1+ million line XML document into a DOM (Document Object Model -- basically a traversable XML node tree). Nokogiri has a really fast XML parser that will generate a DOM for you. Totally awesome -- with one major caveat -- the entire DOM lives in memory. The bigger the XML document, the more memory required. My million-liner was using over 2GB's of RAM. With most modern setups, this isn't necessarily a problem but I do think it raises the question of whether or not we need to have the entire XML document in memory at once.
XML files can get much, much bigger than the 60MB one I had to deal with, so at some point RAM will bottleneck DOM parsing. If you want to be memory conscious, what are your options? Let's talk.
SAX (Simple API for XML) is an alternative parsing strategy that utilizes an event-based XML stream. With SAX, parsers move line-by-line, triggering events when elements are detected. Nokogiri provides a SAX parser. Let's take a look at an example:
SAX in Action
Given an example XML Document (
<?xml version="1.0"?> <Data> <Node> <Element Type="Boom">He's on fire!</Element> </Node> <Node> <Element Type="Shaka">Jams it in!</Element> </Node> <Node> <Element Type="Laka">Razzle Dazzle!</Element> </Node> </Data>
Nokogiri's SAX parsing requires us to define a SAX document parser that defines the event-handling methods:
class Parser < Nokogiri::XML::SAX::Document def start_element(name, attrs = ) # Handle each element, expecting the name and any attributes end def characters(string) # Any characters between the start and end element expected as a string end def end_element(name) # Given the name of an element once its closing tag is reached end end
To begin SAX parsing, we'd use our
Parser like this:
Nokogiri will stream our XML file line-by-line, calling out to
end_element as it hits the opening tag, content, and closing tag. For example, say the SAX parser is handling our
<Element Type="Boom">He's on fire!</Element> line, here's what we'd get if we had a binding (
byebug) in each method:
# In start_element: pry(main)> name => "Element" pry(main)> attrs => [["Type", "Boom"]] # In characters: pry(main)> string => "He's on fire!" # In end_element pry(main)> name => "Element"
SAX is well-suited to pulling particular strings out of an XML document. If you're more concerned about the attributes or content of specific elements rather than the greater context or structure of a particular node/element, look to SAX!
Because XML is a document format used to capture a structured data tree, it almost always means that the structure is important. When parsing XML via SAX, you lose that tree structure. When XML is parsed via a DOM-parser, you get a fully traversable set of objects that mirrors the structure of the original XML file. As I mentioned earlier, Nokogiri's DOM parser is fast and awesome -- it just has to store the DOM in memory.
DOM in Action
The primary DOM parser in Nokogiri is called like this:
Nokogiri::XML(File.open('./test/fixtures/example.xml')) # or.. Nokogiri::XML.parse(File.open('./test/fixtures/example.xml'))
You'll get a
Nokogiri::XML::Document object back -- the DOM representation of the XML file. You can inspect the name, attributes, and child nodes from any point in the DOM. You can also use
xpath to search the DOM.
For most use cases, this is probably all you need. If, however, you do need to been concerned about memory, there's a
Nokogiri::XML::Reader that will stream the XML node-by-node.
Nokogiri::XML::Reader in Action
The XML reader is slightly more memory intensive than the SAX parser, but the advantage here is that you retain the XML structure of the given node. If we had millions of
<Node> nodes but wanted to process each node as soon as it was read, we could extract those nodes one-at-a-time with the reader and shell out to some other object to handle processing of each node:
class NodeHandler < Struct.new(:node) def process # Node processing logic end end Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node| if node.name == 'Node' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT NodeHandler.new( Nokogiri::XML(node.outer_xml).at('./Node') ).process end end
node items the Reader yields to our block are
Nokogiri::XML::Reader objects that represent an XML opening-type (
Nokogiri::XML::Reader::TYPE_ELEMENT) node named
'Node'. Here's what we have inside our
pry(main)> node.name => 'Node' pry(main)> node.node_type => 1 # This means we're at an opening tag for a node pry(main)> node.inner_xml => "\n <Element Type=\"Boom\">He's on fire!</Element>\n " pry(main)> node.outer_xml => "<Node>\n <Element Type=\"Boom\">He's on fire!</Element>\n </Node>"
The reader objects represent nodes and let us inspect the surrounding XML content as strings. The
outer_xml method gives us an XML string that captures exactly one full
<Node> -- just what we need! To do our processing, we use the normal DOM parser to turn that XML string back into an XML DOM, then grab the node we want:
This returns a
Nokogiri::XML::Element-type object with a nice interface:
# Assuming our Nokogiri::XML::Element is stored in a variable named node: pry(main)> node.name => "Node" pry(main)> element = node.at('./Element') => #<Nokogiri::XML::Element:0x3fd5d9a8fcc0 name="Element" attributes=[#<Nokogiri::XML::Attr:0x3fd5d9a8fbf8 name="Type" value="Boom">] children=[#<Nokogiri::XML::Text:0x3fd5d9a8ef78 "He's on fire!">]> pry(main)> element.name => "Element" pry(main)> element['Type'] => "Boom" pry(main)> element.content => "He's on fire!"
There are a number of ways to parse XML in Ruby, but when you're dealing with particularly large XML files -- look to Nokogiri's stream-based XML parsers. Depending on your needs, look to the
Nokogiri::XML::Reader for great justice!