Natural Language Processing of Articles

Aakash Tandel, Former Data Scientist

Article Categories: #Code, #Data & Analytics

Posted on

An overview of natural language processing techniques ranging from text preprocessing to word embeddings with Word2Vec.

In a previous article, I talked about scraping article data from I pulled the title, author name, hashtags, date, and text from each of our posts. My goal was to analyze the text data using topic modeling and word embeddings, in an attempt to learn more about the type of content we are producing on

Natural language processing is a large field of study and the techniques I walk through in this article are just the tip of the iceberg. I will go over preprocessing text data, vectorizers, topic modeling, and word embeddings at a high level. Hopefully, this article gets the wheels turning in your mind about how you too can analyze your large bodies of text and gain understanding with natural language processing.

Unlike using Bayesian models to forecast customer lifetime value, these natural language processing techniques may have less-obvious use cases for business-minded readers. But modern tech companies use these techniques for sentiment analysis, programmatic document classification, document retrieval, automatic summarization of text, chatbot training, and more. How you use these natural language processing techniques is entirely up to you and your use case.

So, let’s begin!

The first step in applying topic modeling, word embedding, or other natural language processing techniques is to clean and preprocess our text data, because computers don’t read human language all that well.


Text or character encoding - the way your computer stores characters like z or 8 as a bit - can be annoying to deal with. For a refresher on character encoding, I recommend (re-)reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. In order to keep Python happy, I keep all of my text data in the universal encoding format UTF-8. If you run across UnicodeDecodeErrors there could be an issue with your character encoding.

Text data needs to be prepped before it can be fed into topic modeling algorithms or vector space models like Doc2Vec. Your method of preprocessing text data can differ based on the specific use case, but I tend to default to the same steps. I began preprocessing my scraped text data from by lowercasing all of the text, removing the punctuation, and getting rid of words with fewer than two characters.

Then, I began the process of tokenization. Tokenization “splits longer strings of text into smaller pieces, or tokens.” All of my article text was contained in the first entry of a list, in a Pandas series. Tokenization turned that list of one item into a list of hundreds of items. Each of our articles was turned into a laundry list of words.

Next, I removed all of the stopwords. Stopwords are common words, such as and or but, that don’t have a lot of meaning as stand-alone words. Packages like Natural Language Toolkit (commonly referred to as NLTK), Gensim, and SpaCy make removing stopwords easy.

Many people stem and lemmatize text data. Say you have the words play, playing, and played. We want our computer to realize these three words are essentially the same. A topic model should put articles with these three words in the same topic. Stemming is when we reduce a word to the root; racoons becomes racoon and juggling becomes juggl. Personally, I am not a big fan of stemming. I have stemmed plenty of text documents and ended up with illegible words. Lemmatization “usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.” I have found lemmatization to be a useful preprocessing step. Once again, NLTK, Gensim, and SpaCy are here to make lemmatization easy.

My preprocessing did not end with lemmatization. I needed to match article titles with page titles from Google Analytics because I wanted to tie web analytics data to the text data. In Google Analytics, each URL and page title has an associated pageview count. In many instances, the page title from Google Analytics did not match the article title from my web scraped data. I solved this problem by removing punctuation and lowercasing all of the page titles. Then I used fuzzy matching - check out the Python library FuzzyWuzzy - to match Google Analytics page titles with article titles.

In the end of my preprocessing, I had a neat and organized Pandas dataframe containing the original web scraped title, the full article, the author name, the date the article was published on, the original hashtags, the tokenized article, the tokenized hashtags, the tokenized title, and the pageviews the article recorded in Google Analytics. I could have continued to preprocess and clean my text data, but at this point I decided to move forward with the more interesting stuff — topic modeling and word embedding.


At this point, I still hadn’t actually translated the human text language — in my case, tokenized article text — into computer language. There are three vectorizers within Scikit-Learn that are commonly used for natural language processing: the Count Vectorizer, the Hashing Vectorizer, and the Term Frequency - Inverse Document Frequency Vectorizer. All three of these vectorizers are based on a bag-of-words model. Bag-of-words models convert our tokens into mathematical representations of tokens, based on their frequency in a document. CountVectorizer is the simplest vectorizer. It counts the number of times a token occurs in a document or body of text and assigns that value to the token. For example, if dog showed up 12 times in one of our articles, then the dog token would be given the value of 12 for that article. The HashingVectorizer uses a hashing function to vectorize a token, turning words into numbers and documents into matrices. The hashing function still relies on term frequency, just like the Count Vectorizer. I don’t usually use the Scikit-Learn HashingVectorizer because the tokens are not legible after the hashing occurs. That said, hashing is beneficial when you have memory issues because the the HashingVectorizer doesn’t require you to “store a vocabulary dictionary in memory.” The TfidfVectorizer weighs how frequently a term occurs within one document (in our case, how frequently a word occurs in a given article) AND how frequently that term occurs in with the entire corpus (in our case, how frequently a word occurs in our entire set of article texts.) I thought the two-pronged approach by TF-IDF was the way to go, so I used Scikit-Learn’s TfidfVectorizer to translate my text data into computer-readable vectors.

Topic Modeling

After that work, I began topic modeling. Topic modeling ended up yielding no useful results for me on this project, but I still want to share the process (it’s about the journey sometimes, right?). Exactly what is topic modeling? “Topic modeling provides us with methods to organize, understand, and summarize large collections of textual information.” The aim of topic modeling is to figure out what topical patterns exist in documents.

In our articles we have the following sentences:

  • Article 1: My tool of choice for Continuous Integration lately is CircleCI 2.0. I've really enjoyed it. I don't think I went out of my way to pick them, honestly, but a couple of co-workers before me set up projects that I work on with CircleCI; I think they made a good choice.
  • Article 2: Welcome to your JUNIOR EMPLOYEE. Before getting started, read this manual carefully to understand your junior’s features and capabilities. Retain for future reference.
  • Article 3: In design school, everyone was on the same page. People knew about existing design trends and understood that every designer has their own process.

Maybe our topic modeling will spit out the following information:

  • Topic 1: 20% coworkers, 30% junior, 50% people
  • Topic 2: 20% integration, 30% trend, 50% reference

As humans, we can see that Topic 1 may have to do with office dynamics and helping people work together. Topic 2 may be about technical specifications. That would make sense with the following output from our topic model:

  • Article 1: 25% Topic 1, 75% Topic 2
  • Article 2: 100% Topic 1
  • Article 3: 50% Topic 1, 50% Topic 2

I used the Python library Gensim to conduct my topic modeling. The topic model I used was Latent Dirichlet Allocation or LDA. “In the LDA model, each document is viewed as a mixture of topics that are present in the corpus. The model proposes that each word in the document is attributable to one of the document’s topics.

There are a few alternatives to LDA. Check out Latent Semantic Indexing (LSI) and Hierarchical Dirichlet Process (HDP) if you have an aversion to LDA. There are pros and cons to all three types of topic models, but I won’t get into that here.

I have hit difficulties using LDA at two points.

Point 1: when you have to choose the number of topics LDA should use. The number of topics has a big impact on what is contained in those topics. You are in control of the number of topics, but the model is in charge of the categorization. Let’s look at an example.

  • Option 1: Say you have chosen to use two topics for your LDA model. Your LDA model will then build two topics. In Topic 1, you have “apple” and “chimpanzee” and in Topic 2 “Porsche” and “Mykonos.” I have difficulty explaining or naming Topic 1 and Topic 2.
  • Option 2: Say you have chosen to use thirty-two topics for your LDA model. In Topic 12, you have “Porsche Boxster” and in Topic 18, you have “Porsche Cayman.” In this case, we probably don’t need topics to be as granular as the specific type of Porsche.
  • Best Scenario: Selecting the appropriate number of topics could put all of our cars in one topics, our food in one topic, our animals in one topic, and our locations in another.

The number of topics is the first important parameter you need to choose when using LDA. I wish I could tell you to always pick five or twelve topics, but I cannot. Experiment with the numbers of topics to find interpretable topics that you can definitively name.

Point 2: interpreting the results of those topics. The topics are not conveniently labeled “Topic about Politics” or “Topic about Hardware.” In some cases, the topic is easily discernible, and in other cases, it isn’t. It is up to you, the data scientist, to give meaning to the topics created by your LDA model.

After I used LDA on our articles, some of the topics made sense. For example, one topic contained design, image, site, page, color, sing, code, work, designer, video, look, text, and scene. If I were to guess, this topic contains articles related to the design side of Viget’s capabilities. Some of the topics were more cryptic. I couldn’t easily identify a topic from site, time, developer, data, project, viget, page, google, code, and craft. All of these words related to aspects of our work at Viget, but there was no clear demarcation of what the topic could be.

I passed my topic model into pyLDAvis in order to see my topics in a data visualization. Even with the data visualization, I was unable to decipher every single one of my topics. Maybe I needed to preprocess my data more. Maybe I should have used the CountVectorizer instead of TfidfVectorizer, or maybe I should have used HDP instead of LDA. Regardless, the ambiguity in my topics revealed the difficulty of working with text data. Maybe topic modeling wasn’t the right move in the first place.

Word Embeddings

Bag-of-words models don’t account for the association between words. “Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other').” I experimented with two VSMs, Word2Vec and Doc2Vec. Though the details of Word2Vec are outside of the scope of this blog post, understand that Word2Vec is a combination of two models, a Continuous Bag-of-Words model and a Skip-Gram model. The Doc2Vec algorithm is similarly built by two models and “represents each document by a dense vector which is trained to predict words in the document.

I used Gensim’s Doc2Vec to see if there was a better way to analyze my text data. Doc2Vec said the most similar words to “data” were:

  • Database
  • Analysis
  • Collected
  • Visualization
  • API
  • Retool
  • Datasets
  • Tracking
  • Alphabetically
  • Recognizes

That made sense! Similar words to Mitch and Paul (two members of the Data & Analytics team) were words like stanley, analyst, decker, and albert. Stanley Black & Decker is one of our team’s clients, and Albert is another member of the Data & Analytics team. When I asked Doc2Vec to find the most similar words to Mitch and Paul and words that were dissimilar to client, it returned daniel, koch, greg, and irish. Mitch’s last name is Daniels and Paul’s is Koch. Doc2Vec removed words related to Stanley Black & Decker from the list.

Additionally, the Doc2Vec model gave me a list of most-similar documents when I passed a specific document. Word embeddings capturing the semantic relationship between words and phrases was surprisingly successful at lumping together text data from our articles.

I used Word2Vec or Doc2Vec to convert our words and documents into vectors, and those vectors could be used to conduct sentiment analysis, to compare the similarity between documents, to perform information retrieval, or to automatically summarize bodies of text. We can even use vector representation to power chatbots. I will be the first to admit that my experience in applying word embeddings to deployable products is wholly experimental. My next immediate natural language processing goal is to use VSMs to automatically summarize each of our articles (stay tuned for that article.) If successful, this technique could be applied to creating meta tags and descriptions for our clients with large content sites.


At Viget, we will continue experimenting with topic modeling and word embeddings algorithms like Word2Vec and Doc2Vec. Analyzing text data can be a difficult endeavor, but the potential benefits are massive.

Related Articles