Extract Embedded Text from PDFs with Poppler in Ruby

Article Categories: #Code, #Back-end Engineering

Posted on February 10, 2022

H e r e ' s h o w ( a n d w h y ) t o u s e t h e o p e n - s o u r c e P o p p l e r l i b r a r y ( a n d i t s c o r r e s p o n d i n g R u b y g e m ) t o e x t r a c t t e x t c o n t e n t f r o m y o u r P D F d o c u m e n t s .

A recent client request had us adding an archive of magazine issues dating back to the 1980s. Pretty straightforward stuff, with the hiccup that they wanted the magazine content to be searchable. Fortunately, the example PDFs they provided us had embedded text content¹, i.e. the text was selectable. The trick was to figure out how to programmatically extract that content.

Our first attempt involved the pdf-reader gem, which worked admirably with the caveat that it had a little bit of trouble with multi-column / art-directed layouts², which was a lot of the content we were dealing with.

A bit of research uncovered Poppler, “a free software utility library for rendering Portable Document Format (PDF) documents,” which includes text extraction functionality and has a corresponding Ruby library. This worked great and here’s how to do it.

Install Poppler

Poppler installs as a standalone library. On Mac:

brew install poppler

On (Debian-based) Linux:

apt-get install libgirepository1.0-dev libpoppler-glib-dev

In a (Debian-based) Dockerfile:

RUN apt-get update && \
  apt-get install -y libgirepository1.0-dev libpoppler-glib-dev && \
  rm -rf /var/lib/apt/lists/*

Then, in your Gemfile:

gem "poppler"

Use it in your application

Extracting text from a PDF document is super straightforward:

document = Poppler::Document.new(path_to_pdf)
document.map { |page| page.get_text }.join

The results are really good, and Poppler understands complex page layouts to an impressive degree. Additionally, the library seems to support a lot more advanced functionality. If you ever need to extract text from a PDF, Poppler is a good choice.

John Popper photo by Gage Skidmore, CC BY-SA 3.0

Note that we’re not talking about extracting text from images/OCR; if you need to take an image-based PDF and add a selectable text layer to it, I recommend OCRmyPDF. ↩︎
So for a page like this:
```
+-----------------+---------------------+
| This is a story | my life got flipped |
| all about how   | turned upside-down  |
+-----------------+---------------------+
```
pdf-reader would parse this into “This is a story my life got flipped all about how turned upside-down,” which led to issues when searching for multi-word phrases. ↩︎

Finding Balance With AI and Digital Teams

Testing Craft CMS

Extract Embedded Text from PDFs with Poppler in Ruby

Install Poppler

Use it in your application

Related Articles

Testing Craft CMS

Beautiful, Dynamic Charts

Can AI Replace UI Developers?

The Viget Newsletter