www 101

All you need to know about the internet

Have a Question?

If you have any question you can ask below or enter what you are looking for!

Preparing a Book Index Using Python

You have
probably come across some of those large text books and noticed the index
at the end. With a hard copy, it is nice to have such an index to navigate to the
desired page quickly. I have recently published a very short book, and when it came to
setting the index, the task seemed daunting even though the book is very short. The
book doesn’t have an index yet anyway.

If you
have been following my articles, you will
notice that I mainly write about Python and how it can help us in solving
different issues in a simple manner. So let’s see how we can set a book index
using Python.

Without further
ado, let’s get started.

What Is
a Book Index?

I’m pretty
sure that most of you know what a book index is, but I just want to quickly
clarify this concept.

A book
index is simply a collection of words and/or phrases that are considered
important to the book, along with their locations in the book. The index does
not contain every word/phrase in the book. The reason for that is
shown in the next section.

What Makes
a Good Book Index?

What if
you had an index through which you can find the location of each word or phrase
in the book? Wouldn’t that be considered as the index of choice? Wrong!

The index
of choice, or what would be considered a good index, is that which points to the
important words and phrases in the book. You might be questioning the reason
for that. Let’s take an example. Say that we have a book that consists only of
the following sentence:

My book is short

What would
happen if we try to index each word and phrase in that very short sentence,
assuming that the location is the word number in the sentence? This is the
index that we would have in this case:

From the
example above, we can see that such an index would be larger than the book
itself! So a good index would be one that contains the words and
phrases considered important to the reader.

Setup

Natural Language Toolkit (NLTK)

In this
tutorial, we will be using the Natural Language Toolkit (NLTK) library, which is
used to work with human language data. As mentioned in the documentation, NLTK
has been called “a wonderful tool for teaching, and working in, computational
linguistics using Python,” and “an amazing library to play with natural
language.”

I’m currently
writing this tutorial from my Ubuntu machine, and the steps for installing NLTK
in this section will be relevant to the Ubuntu Operating System. But don’t
worry, you can find the steps for installing NLTK on other Operating Systems on the NLTK website.

In order
to install NLTK, I’m going to use pip. If you don’t have pip
installed already, you can use the following command in your terminal to
install pip:

sudo easy_install3 pip

To make sure you have pip installed, type the following command:

pip --version

You should
get something similar to the following:

pip 8.1.2 from /usr/local/lib/python3.5/dist-packages/pip-8.1.2-py3.5.egg (python
3.5)

Now, to install NLTK, simply run the following command in your terminal:

sudo pip install -U nltk

You can test the nltk installation by typing python, and then importing nltk in your terminal. If  you get ImportError:
No module named nltk
, this thread might help you
out.

Test File

At this point, we need a test file
(book) to use for creating a book index. I’ll grab this book: The Rate of Change of the Rate of Change by the EFF. You can download the text file of the book from Dropbox. You can of course use
any book of your choice; you just need something to experiment with in this
tutorial.

Program

Let’s start
with the interesting part in this tutorial, the program that will help us form
the book index. The first thing we want to do is find the word frequency in the
book. I have shown how we can do that in another tutorial, but I want
to show you how we can do that using the NLTK library.

This can
be done as follows:

When you run the program, you will notice that we will have a very long list of words
and their frequencies.

Before moving
further, let’s analyze the above code a bit. In the following line:

We are trying to use the Counter() function in order to get the word frequencies
in the book (how many times the word occurred in the book).

word_tokenize, on the other hand, splits the sentences into their
constituent parts. Let’s take a simple example to see how word_tokenize
actually works:

The output of the above script is as follows:

['My',
'name', 'is', 'Abder', '.', 'I', 'like', 'Python', '.', 'It', "'s",
'a', 'pretty', 'nice', 'programming', 'language']

We then loop through the words and find the frequency of occurrence of each word.
What about phrases (combination of words)? Those are called collocations
(a sequence of words that occur together often). An example of collocations is bigrams,
that is a list of word pairs. Similar to that is trigrams (a combination
of three words), and so forth (i.e. n-grams).

Let’s say we want to extract the bigrams from our book. We can do that as follows:

The number 2 in the apply_freq_filter(
)
function is telling us to ignore all bigrams
that occur less than two times in the book.

If we
want to find the 30 most occurring bigrams in the book, we can use
the following code statement:

Finally, if
we would like to find the location, which is in our case where the word or
phrase occurs in the book (not the page number), we can do the following:

The above
statements seem to return the word location in a sentence, similar to what we
have seen in our short sentence example at the beginning of the tutorial.

Putting It All Together

Let’s put
what we have learned in a single Python script. The following script will read
our book and return the word frequencies, along with the 30 most occurring
bigrams in the book, in addition to the location of a word and a phrase in the
book:

Conclusion

As we
have seen in this tutorial, even a short text can be very daunting when it
comes to building an index for that text. Also, an automated way for building
the optimum index for the book might not be feasible enough.

We were
able to solve this issue through using Python and the NLTK
library, where we could pick the best words and phrases for the book index
based on their frequency of occurrence (i.e. importance) in the book.

There is, of course, more you can do with NLTK, as shown in the library’s documentation.
You can also refer to the book Natural Language Processing with Python if you would like to go deeper in
this library.