Natural Language Processing

An introduction to NLP (Natural language processing) and its powerful libraries

Natural language processing (NLP) is a significant subfield of machine learning, which deals with the interactions between machine (computer) and human (natural) languages. Natural languages are not limited to speech and conversation. They can be in writing and sign languages as well. Nowadays, NLP has been broadly involved in our daily lives: we cannot live without machine translation; weather forecast scripts are automatically generated; we find voice search convenient; we get the answer to a question quickly thanks to the intelligent question answering system; speech-to-text technology helps students with special needs.

The post is taken from the book Python Machine Learning By Example by Packt Publishing written by Yuxi (Hayden) Liu. In this book, you will learn the fundamentals of machine learning and master the art of building your own machine learning systems with an example-based practical guide.

In this post, you will learn about NLP and its powerful NLP libraries in Python.

Understanding a language might be difficult, but would it be easier to automatically translate texts from one language to another? In my first ever programming course, the lab booklet had the algorithm for coarse machine translation. We can imagine that this type of translation involved consulting dictionaries and generating new text. A more practically feasible approach would be to gather texts that are already translated by humans and train a computer program on these texts. In 1954, scientists claimed in the Georgetown–IBM experiment that machine translation would be solved within three to five years.

Unfortunately, a machine translation system that can beat human translators doesn’t exist yet. But machine translation has been greatly evolving since the introduction of deep learning.

Conversational agents or chatbots are another hot topic in NLP. The fact that computers are able to have a conversation with us has reshaped the way businesses are run. In 2016, Microsoft's AI chatbot Tay was unleased to mimic a teenage girl and converse with users on Twitter in real time. She learned how to speak from what users posted and commented on Twitter. However, she was overwhelmed by tweets from trolls and automatically learned their bad behavior and started to output inappropriate things on her feeds. She ended up being terminated within 24 hours.

An important use case for NLP at a much lower level compared to the previous cases is part of speech tagging. A part of speech (POS) is a grammatical word category such as a noun or verb. Part of speech tagging tries to determine the appropriate tag for each word in a sentence or a larger document. The following table gives examples of English POS:

Part of speech



David, machine


Them, her


Awesome, amazing


Read, write


Very, quite


Out, at


And, but


Unfortunately, luckily


A, the


Touring powerful NLP libraries in Python

After a short list of real-world applications of NLP, we will now be learning the essential stack of Python NLP libraries. These packages handle a wide range of NLP tasks as mentioned above as well as others such as sentiment analysis, text classification, named entity recognition, and many more.

The most famous NLP libraries in Python include Natural Language Toolkit (NLTK), Gensim and TextBlob. The scikit-learn library also has NLP related features. NLTK was originally developed for education purposes and is now being widely used in industries as well. There is a saying that you can't talk about NLP without mentioning NLTK. It is the most famous and leading platform for building Python-based NLP applications. We can install it simply by running the sudo pip install -U nltk command in Terminal.

NLTK comes with over 50 collections of large and well-structured text datasets, which are called corpora in NLP. Corpora can be used as dictionaries for word occurrences checking and as training pools for model learning and validating. Some useful and interesting corpora include Web text corpus, Twitter samples, Shakespeare corpus sample, Sentiment Polarity, Names corpus (it contains lists of popular names, which we will be exploring very shortly), Wordnet, and the Reuters benchmark corpus. The full list can be found here. Before using any of these corpus resources, we first need to download it by running the following scripts in Python interpreter:

>>> import nltk


A new window will pop up and ask us which package or specific corpus to download:


Installing the whole package, which is popular, is strongly recommended since it contains all important corpora needed for our current study and future research. Once the package is installed, we can now look at its Names corpus: First, import the corpus:

>>> from nltk.corpus import names

The first ten names in the list can be displayed with the following:

>>> print names.words()[:10]

[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi', u'Abbie',

u'Abby', u'Abigael', u'Abigail', u'Abigale']

There are in total 7,944 names:

>>> print len(names.words())


Other corpora are also fun to explore.

Besides the easy-to-use and abundant corpora pool, more importantly, NLTK is responsible for conquering many NLP and text analysis tasks, including the following:

  • Tokenization: Given a text sequence, tokenization is the task of breaking it into fragments separated with whitespaces. Meanwhile, certain characters are usually removed, such as punctuations, digits, emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words, trigrams of three consecutive words, and n-grams of n consecutive words. Here is an example of tokenization:


  • POS tagging: We can apply an off-the-shelf tagger or combine multiple NLTK taggers to customize the tagging process. It is easy to directly use the built-in tagging function pos_tag, as in pos_tag(input_tokens) for instance. But behind the scene, it is actually a prediction from a prebuilt supervised learning model. The model is trained based on a large corpus composed of words that are correctly tagged.
  • Named entities recognition: Given a text sequence, the task of named entities recognition is to locate and identify words or phrases that are of definitive categories, such as names of persons, companies, and locations. We will briefly mention it again in the next chapter.
  • Stemming and lemmatization: Stemming is a process of reverting an inflected or derived word to its root form. For instance, machine is the stem of machines, learning and learned are generated from learn. Lemmatization is a cautious version of stemming. It considers the POS of the word when conducting stemming. We will discuss these two text preprocessing techniques in further detail shortly. For now, let’s quickly look at how they are implemented respectively in NLTK:

First, import one of the three built-in stemmer algorithms (Lancaster Stemmer and SnowballStemmer are the rest two), and initialize a stemmer:

>>> from nltk.stem.porter import PorterStemmer

>>> porter_stemmer = PorterStemmer()

Stem machines, learning:

>>> porter_stemmer.stem('machines')


>>> porter_stemmer.stem('learning')


Note that stemming sometimes involves chopping off letters, if necessary, as we can see in machine.

Now import a lemmatization algorithm based on Wordnet corpus built-in, and initialize an lemmatizer:

>>> from nltk.stem import WordNetLemmatizer

>>> lemmatizer = WordNetLemmatizer()

Similarly, lemmatize machines, learning:

>>> lemmatizer.lemmatize('machines')


>>> lemmatizer.lemmatize('learning')


Why learning is unchanged? It turns out that this algorithm only lematizes on nouns by default.

Gensim, developed by Radim Rehurek, has gained popularity in recent years. It was initially designed in 2008 to generate a list of similar articles, given an article, hence the name of this library (generate similar to Gensim). It was later drastically improved by Radim Rehurek in terms of its efficiency and scalability. Again, we can easily install it via pip by running the command pip install --upgrade genism in terminal. Just make sure the dependencies NumPy and SciPy are already installed.

Gensim is famous for its powerful semantic and topic modeling algorithms. Topic modeling is a typical text-mining task of discovering the hidden semantic structures in a document. Semantic structure in plain English is the distribution of word occurrences. It is obviously an unsupervised learning task. What we need to do is feed in plain text and let the model figure out the abstract topics.

In addition to the robust semantic modelling methods, Gensim also provides the following functionalities:

  • Similarity querying, which retrieves objects that are similar to the given query object
  • Word vectorization, which is an innovative way to represent words while preserving word co-occurrence features
  • Distributed computing, which makes it feasible to efficiently learn from millions of documents

TextBlob is a relatively new library built on top of NLTK. It simplifies NLP and text analysis with easy-to-use built-in functions and methods and also wrappers around common tasks. We can install TextBlob by running the pip install -U textblob command in terminal.

Additionally, TextBlob has some useful features, which are not available in NLTK currently, such as spell checking and correction, language detection and translation.

Last but not least, scikit-learn provides all text processing features we need, such as tokenization, besides the comprehensive machine learning functionalities. Plus, it comes with a built-in loader for the 20 newsgroups dataset.

In this post, we learned about NLP and also about its powerful libraries in Python. To learn more about click-through prediction with Tree-based algorithms in Python, read our book Python Machine Learning By Example by Packt Publishing.


© copyright 2017 All Rights Reserved.

A Product of HunterTech Ventures