probably come across some of those large text books and noticed the index
at the end. With a hard copy, it is nice to have such an index to navigate to the
desired page quickly. I have recently published a very short book, and when it came to
setting the index, the task seemed daunting even though the book is very short. The
book doesn’t have an index yet anyway.
have been following my articles, you will
notice that I mainly write about Python and how it can help us in solving
different issues in a simple manner. So let’s see how we can set a book index
ado, let’s get started.
a Book Index?
sure that most of you know what a book index is, but I just want to quickly
clarify this concept.
index is simply a collection of words and/or phrases that are considered
important to the book, along with their locations in the book. The index does
not contain every word/phrase in the book. The reason for that is
shown in the next section.
a Good Book Index?
you had an index through which you can find the location of each word or phrase
in the book? Wouldn’t that be considered as the index of choice? Wrong!
of choice, or what would be considered a good index, is that which points to the
important words and phrases in the book. You might be questioning the reason
for that. Let’s take an example. Say that we have a book that consists only of
the following sentence:
My book is short
happen if we try to index each word and phrase in that very short sentence,
assuming that the location is the word number in the sentence? This is the
index that we would have in this case:
my book is short: 1 my book is: 1 my book: 1 my: 1 short: 4 is short: 3 is: 3 book is short: 2 book is: 2 book: 2
example above, we can see that such an index would be larger than the book
itself! So a good index would be one that contains the words and
phrases considered important to the reader.
Natural Language Toolkit (NLTK)
tutorial, we will be using the Natural Language Toolkit (NLTK) library, which is
used to work with human language data. As mentioned in the documentation, NLTK
has been called “a wonderful tool for teaching, and working in, computational
linguistics using Python,” and “an amazing library to play with natural
writing this tutorial from my Ubuntu machine, and the steps for installing NLTK
in this section will be relevant to the Ubuntu Operating System. But don’t
worry, you can find the steps for installing NLTK on other Operating Systems on the NLTK website.
to install NLTK, I’m going to use pip. If you don’t have pip
installed already, you can use the following command in your terminal to
sudo easy_install3 pip
To make sure you have pip installed, type the following command:
get something similar to the following:
pip 8.1.2 from /usr/local/lib/python3.5/dist-packages/pip-8.1.2-py3.5.egg (python
Now, to install NLTK, simply run the following command in your terminal:
sudo pip install -U nltk
You can test the nltk installation by typing
python, and then importing nltk in your terminal. If you get
ImportError:, this thread might help you
No module named nltk
At this point, we need a test file
(book) to use for creating a book index. I’ll grab this book: The Rate of Change of the Rate of Change by the EFF. You can download the text file of the book from Dropbox. You can of course use
any book of your choice; you just need something to experiment with in this
with the interesting part in this tutorial, the program that will help us form
the book index. The first thing we want to do is find the word frequency in the
book. I have shown how we can do that in another tutorial, but I want
to show you how we can do that using the NLTK library.
be done as follows:
import nltk, collections from nltk.collocations import * frequencies = collections.Counter() with open('bigd10.txt') as book: read_book = book.read() words = nltk.word_tokenize(read_book) for w in words: frequencies[w] += 1 print (frequencies)
When you run the program, you will notice that we will have a very long list of words
and their frequencies.
further, let’s analyze the above code a bit. In the following line:
frequencies = collections.Counter()
We are trying to use the
Counter() function in order to get the word frequencies
in the book (how many times the word occurred in the book).
word_tokenize, on the other hand, splits the sentences into their
constituent parts. Let’s take a simple example to see how
from nltk.tokenize import word_tokenize sentence = 'My name is Abder. I like Python. It's a pretty nice programming language' print (word_tokenize(sentence))
The output of the above script is as follows:
'name', 'is', 'Abder', '.', 'I', 'like', 'Python', '.', 'It', "'s",
'a', 'pretty', 'nice', 'programming', 'language']
We then loop through the words and find the frequency of occurrence of each word.
What about phrases (combination of words)? Those are called collocations
(a sequence of words that occur together often). An example of collocations is bigrams,
that is a list of word pairs. Similar to that is trigrams (a combination
of three words), and so forth (i.e. n-grams).
Let’s say we want to extract the bigrams from our book. We can do that as follows:
bigram = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) finder.apply_freq_filter(2)
2 in the
apply_freq_filter( function is telling us to ignore all bigrams
that occur less than two times in the book.
want to find the
30 most occurring bigrams in the book, we can use
the following code statement:
print (finder.nbest(bigram.pmi, 30))
we would like to find the location, which is in our case where the word or
phrase occurs in the book (not the page number), we can do the following:
print (read_book.index('computer')) print (read_book.index('Assisted Reporting'))
statements seem to return the word location in a sentence, similar to what we
have seen in our short sentence example at the beginning of the tutorial.
Putting It All Together
what we have learned in a single Python script. The following script will read
our book and return the word frequencies, along with the 30 most occurring
bigrams in the book, in addition to the location of a word and a phrase in the
import nltk, collections from nltk.collocations import * frequencies = collections.Counter() with open('bigd10.txt') as book: read_book = book.read() words = nltk.word_tokenize(read_book) for w in words: frequencies[w] += 1 bigram = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) finder.apply_freq_filter(2) print ('Those are the words and their frequency of occurrence in the book:') print (frequencies) print ('#################################################################') print ('Those are the 30 most bigrams occurring in the book:') print (finder.nbest(bigram.pmi, 30)) print (read_book.index('computer')) print (read_book.index('Assisted Reporting'))
have seen in this tutorial, even a short text can be very daunting when it
comes to building an index for that text. Also, an automated way for building
the optimum index for the book might not be feasible enough.
able to solve this issue through using Python and the NLTK
library, where we could pick the best words and phrases for the book index
based on their frequency of occurrence (i.e. importance) in the book.
There is, of course, more you can do with NLTK, as shown in the library’s documentation.
You can also refer to the book Natural Language Processing with Python if you would like to go deeper in