Splitting PDFs with HexaPDF

Eli Fatsi, Development Director

Article Categories: #Code, #Back-end Engineering

Posted on

How to split multi-page PDFs into smaller subsets using HexaPDF

On a recent client project of ours, we built out the ability to split large PDF files into smaller subsets based on page numbers. The UI was simple and (hopefully) user-friendly, and the code that backs the flow was fairly straight forward, mostly thanks to the Ruby gem HexaPDF. Let's jump into how this thing works!

The UI

Here's a rundown of the steps a user must take to split a PDF into subsets.

Step 1: Upload a PDF. This is saved as a temporary file on the application's server for the duration of the split procedure.

upload

Step 2: Review the PDF. You can scroll horizontally through all of the pages, and click on them to reveal a zoomed in version of the page. Assuming you have a plan in mind to split your PDFs, this view makes it easy to see where the splits need to occur.

inspect

Step 3: Lay out the split instructions. For each split that you're going for, you must provide the start page, end page, and a file name for the impending new document.

chop

The Code

As mentioned before, we leaned heavily on HexaPDF for this task. According to it's README, this gem is helpful when [creating, manipulating, merging, extracting, securing, optimizing] PDF files. In our case we're extracting from one PDF and creating others, and HexaPDF worked like a charm. Let's see how this looks in code!

Quick caveat: I'm going to gloss over some of the code for the sake of brevity (routes/controllers for the UI flow, how PDFs are synced to S3 & the database). Here we'll stick to the interesting bits of subsetting a PDF.

Assuming we have a temporary PDF on hand and some split parameters that look roughly like this:

split_attributes = [
  {
    filename: "New File 1",
    start_page: 3,
    end_page: 6,
  },
  {
    filename: "New File 2",
    start_page: 8,
    end_page: 9,
  }
]

then here's the set of HexaPDF calls to get the job done

pdf = HexaPDF::Document.open(temporary_pdf.file.path)

split_attributes.each do |split|
  create_pdf_split(pdf, split)
end

And there you have it! Code that splits PDFs in 4 lines! Just kidding, let's unpack that create_pdf_split method call.

def create_pdf_split(pdf, split)
  new_pdf = HexaPDF::Document.new

  pdf.pages.each_with_index do |page, index|
    # Move on if we're outside the start/end range
    next if index + 1 < split[:start_page]
    next if index + 1 > split[:end_page]

    new_pdf.pages << new_pdf.import(page)
  end

  filename = Rails.root.join("tmp", "#{split[:filename]}.pdf").to_s

  # https://github.com/gettalong/hexapdf/issues/30
  # Adding `validate: false` as a workaround for the HexaPDF error:
  # Validation error: Required field BaseFont is not set
  pdf.write(filename, validate: false, optimize: true)
end

As mentioned before, we're leaning on HexaPDF for two purposes: extracting and creating. We use HexaPDF::Document.open and pdf.pages.each_with_index in order to read through the pages of our starter PDF one by one. We then create a new document for each configured split by using HexaPDF::Documnet.new and new_pdf.pages << new_pdf.import(page).

At the end of the day, this feature came together pretty quickly thanks to the power of HexaPDF combined with its Ruby-esque ease of use.

Finally, an ode to open source maintainers

PDF format issues are a thing. We hypothesize that PDF files are generated from a wide variety of origins (exported from a Google Doc, downloaded from DocuSign, scanned in from any number of scanner manufacturers), and bring with them their own flavor of formatting curve balls.

As such, our final code for this feature is sprinkled with a few lines of code (such as the validate: false parameter on the last line) in order to gracefully handle malformated PDF files. Best case, the formatting issue can be dynamically resolved in code. Worst case, we attempt to surface the most useful error that we can to the user (which is usually just: "open & save this PDF somehow and try again" AKA "turn it off and turn it on again").

Hats off to the maintainer of the HexaPDF gem - gettalong. Not only did the HexaPDF gem make this feature dead simple to implement, but the author worked closely with me and others in getting to the bottom of malformed PDF issues. The world of open source software is a wonderful world to be a part of, and I greatly appreciate those who contribute their time and talents to this degree.

Eli Fatsi

Eli uses his mathematics degree from Carnegie Mellon to blur the lines between the digital and physical worlds. He codes for Shure, Volunteers of America, and other clients from our Boulder, CO, office.

More articles by Eli

Related Articles