Web Scraping for Vaccine Availability

Danny Brown, Senior Developer

Article Category: #Code

Posted on

Scraping the web can be powerful, yet difficult to maintain and tricky to get right. However, it can be worth the effort, particularly in our circumstances relating to COVID-19 vaccine rollout.

The problem at hand #

At the beginning of 2021, I was part of the Viget team tasked with creating a website that displayed availabilities/appointments for COVID vaccines in Massachusetts.

As a whole, this project was very complex. In this article, I will zero in on the core feature that drove the functionality of the site: finding appointment availabilities.

This was done through the use of a scraper. A scraper is, in short, a program that takes a snapshot of a web page and extracts data from it. There are different techniques for scraping, but today I will focus on pulling down the HTML and parsing it directly. A simple example would be to scrape a website that posts jobs; the scraper would access the web page, grab the HTML, and parse the job postings. Once we have the job postings parsed, we could send an email with a digest of the various job postings we were able to scrape.

We did something similar on the vaccine finder; we focused on parsing the HTML of the data we pulled down. This is commonly done through a Python package called Beautiful Soup. Beautiful Soup is a fairly simple HTML and XML parser, with incredibly easy installation and setup.

For the vaccine finder, we started with one site that we would visit (programmatically), grab the HTML on the page, and parse it for vaccine appointment times.

Here is a modified version of some of the base logic behind this scraper:

def my_scraper():
  url = "https://some-vaccine-website.com"

    print(f"processing {url}")
    result = requests.get(url)

    if result.status_code != 200:
        raise Exception(
            f"Failed to fetch {url} with status {result.status_code}"
        )

    soup = BeautifulSoup(result.content, features="lxml")

    vaccine_sites = soup.find_all("div", class_="special_div_class")

To break down some of the pieces of that example:

1. vaccine_sites = soup.find_all("div", class_="special_div_class")

This is one of the ways to use BeautifulSoup — finding all instances of <div> tags that match a certain class, and parsing them and their children. In this example, an HTML block such as:

<div class="special_div_class">
    <h3>Gillette Stadium Vaccination Center</h3>
    <ul>
        <li>
            March 1st, 2021 at 8:15 AM
        </li>
        <li>
            March 1st, 2021 at 8:30 AM
        </li>
        ...
    </ul>
</div>

would be parsed and put into the vaccine_sites variable. We would then pull out appointment times for each site using other HTML tags, and then entering them into our database.

2. soup = BeautifulSoup(result.content, features="lxml")

BeautifulSoup comes with an OOTB parser, however, I'd recommend using lxml as the parser if you plan on using the package. While it introduces dependencies, it's faster than the OOTB parser. If you can remember in the early months of 2021, the COVID vaccination appointment ecosystem was comparable to trying to get a PS5 from an online retailer at console launch. Appointments were being filled up almost faster than they were made available, so having accurate numbers was incredibly important.

Accurate appointment numbers have a separate bottleneck though, beyond scraper speed. The scraper only runs when it's called — that means there must be some mechanism for calling the scraper code at an interval to get up-to-date availabilities.

We achieved this by wrapping the scrapers in tasks and running them in Celery.

Here is some example code for running a task every five minutes:

from celery import Celery
    from celery.schedules import crontab

    app = Celery()
    app.conf.beat_schedule["my_scraper"] = {
        # this would look for a method in scrapers.py called my_scraper
        "task": "scrapers.my_scraper",
        "schedule": crontab(minute="*/5"),
    }

Awesome! Now we have a scraper that runs every 5 minutes to check for new appointments. We can now blissfully ride off into the sunset, knowing that the job is done...except that's not the case.

Here are a few things that went wrong during the development and rollout of the vaccine finder:

HTML structure and style changes #

Let's say the vaccine appointment site has been getting demolished by traffic from residents trying to schedule their vaccine appointments. One of the developers of that website found some optimization that will make the site faster and more responsive, and it happens to change the structure of the HTML on the page. On the vaccine finder, we hit a large variety of these changes to the HTML, but a simple example was when the div class was changed to use a newer version of styles. In reality, larger, more structural changes happened. However, those required similarly simple, yet tedious, fixes.

<div class="special_div_class_v2">
        <h3>Gillette Stadium Vaccination Center</h3>
        <ul>
            <li>
                March 1st, 2021 at 8:15 AM
            </li>
            <li>
                March 1st, 2021 at 8:30 AM
            </li>
            ...
        </ul>
    </div>
soup.find_all("div", class_="special_div_class") # This will completely fail to get any appointments now

The fix is rather simple, just change the soup.find_all class parameter to the new class name. The sinister part of this problem is that it happened very frequently.

Scrapers are fragile. While the example scraper is particularly fragile and relies on the class of a div being consistent, fragility in scrapers is something that can be addressed but likely impossible to deal with fully. It is the nature of most scrapers that issues like this will arise.

Introducing some required JavaScript onto the page #

Using the same fake website as an example, let's say that the developers add another change to the page that requires JavaScript to load in order to view appointments. This becomes an issue with our implementation since the Python requests library does not support JavaScript. How would we get around this? There are multiple answers to this question, but we decided to use Selenium Webdriver. Using Selenium allowed us to wait for and interact with the JavaScript on the page. Here is a snippet of what that could look like:

def my_scraper():
        url = "https://some-vaccine-website.com"
    
        options = webdriver.ChromeOptions()
        options.headless = True

        browser = webdriver.Chrome(desired_capabilities=options.to_capabilities())
        browser.get(url)

        WebDriverWait(browser, 30).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "some_css_selector")
            )
        )

    soup = BeautifulSoup(browser.page_source, features="lxml")
    browser.quit()

    # parsing what we pulled down
    # ...

This allowed us to access the website now that they added some necessary JavaScript to the page and still find those vaccine availabilities.

In general, the vaccine finder application was an extremely rewarding app to work on. It was the first time that something I worked on made an obvious difference (at least to me). While web scraping is a powerful tool, dealing with the fragility of the scrapers can feel like an unending task, since we are at the mercy of the sites we're scraping to stay (partially) static. Ultimately, in this case, the extra work was completely worth it, given the nature of the solution we were providing.

Danny Brown

Danny is a senior developer in the Falls Church, VA, office. He loves learning new technology and finding the right tool for each job.

More articles by Danny

Related Articles