We use the term len to get the length of something, which we'll apply here to the book of Genesis: len(text3) 44764 so genesis has 44,764 words and punctuation symbols, or "tokens." A token is the technical name for a sequence of characters — such. When we count the number of tokens in a text, say, the phrase to be or not to be, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of to, two of be, and one each of or and not. But there are only four distinct vocabulary items in this phrase. How many distinct words does the book of Genesis contain? To work this out in Python, we have to pose the question slightly differently. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.

Org/ for installation instructions. Note you can also plot the frequency of word usage through potato time using m/ngrams Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.) nerate in the beginning of his brother is a hairy man, whose top may reach unto heaven ; and ye shall sow the land of Egypt there was no bread in all. So shall thy harry wages be? And they made their father ; and Isaac was old, and kissed him : and Laban with his cattle in the midst of the hands of Esau thy first born, and Phichol the chief butler unto his son Isaac, she note The generate method. 1.4 Counting Vocabulary The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text in a variety of useful ways. As before, you will jump right in and experiment with the python interpreter, even though you may not have studied Python systematically yet. Test your understanding by modifying the examples, and trying the exercises at the end of the chapter. Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear.

You might like to try more words (e.g., liberty, constitution and different texts. Can you predict the dispersion of a word before margaret you view it? As before, take care to get the"s, commas, brackets and parentheses exactly right. "democracy "freedom "duties "America figure.2: Lexical Dispersion Plot for Words. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time. Note Important: you need to have python's Numpy and Matplotlib packages installed in order to produce the graphical plots used in this book. Please see http nltk.

The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words london by square brackets as well as parentheses, and separate them with a comma: mmon_contexts monstrous "very a_pretty is_pretty am_glad be_glad with a_lucky note your Turn: Pick another pair of words and compare their usage in two different texts, using the. It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text. In.2 we see some striking patterns of word usage over the last 220 years (in an artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end). You can produce this plot as shown below.

(Note that this corpus is uncensored!) Once you've spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English. A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the _ pictures and a _ size. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses: milar monstrous mean part maddens doleful gamesome subtly uncommon careful untoward exasperate loving passing mouldy christian few true mystifying imperial modifies. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very.

chapter 55 Of the monstrous Pictures of Whales. I shall ere l ing Scenes. In connexion with the monstrous pictures of whales, i am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling. But of Whale - bones ; for Whales of a monstrous size are oftentimes cast up dead u the first time you use a concordance on a particular text, it takes a few extra seconds to build an index so that subsequent searches are fast. Note your Turn: Try searching for other words; to save re-typing, you might be able to use up-arrow, Ctrl-up-arrow or Alt-p to access the previous command and modify the word being searched.

You can also try searches on some of the other texts we have included. For example, search Sense and Sensibility for the word affection, using ncordance affection. Search the book of Genesis to find out how long some people lived, using ncordance lived. You could look at text4, the Inaugural Address Corpus, to see examples of English going back to 1789, and search for words like nation, terror, god to see how these words have been used differently over time. We've also included text5, the nps blessing chat Corpus : search this for unconventional words like im, ur, lol.

Take care to get spelling and punctuation right, and remember that you don't type the. from ok import * * Introductory Examples for the nltk book * loading text1,., text9 and sent1,., sent9 Type the name of the text or sentence to view. Type: 'texts or 'sents to list the materials. Text1: Moby dick by herman Melville 1851 text2: Sense and Sensibility by jane austen 1811 text3: The book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty python and the holy Grail text7: Wall Street journal text8: Personals Corpus text9: The man Who. Chesterton 1908 any time we want to find out about these texts, we just have to enter their names at the python prompt: text1 text: Moby dick by herman Melville 1851 text2 text: Sense and Sensibility by jane austen 1811 now that we can use. 1.3 Searching Text There are many ways to examine the context of a text apart from simply reading.

A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby dick by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses: ncordance monstrous displaying 11 of 11 matches: ong the former, one was of a most monstrous size. This came towards us, on of the psalms. " touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears. Some were thick d as you gazed, and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous! That Himmal they might scout at Moby dick as a monstrous fable, or still worse and more de th of Radney.

It consists of about 30 compressed files requiring about 100Mb disk space. The full collection of data (i.e., all in the downloader) is nearly ten times this size (at the time of writing) and continues to expand. Once the data is downloaded to your machine, you can load some of it using the python interpreter. The first step is to type a special command at summary the python prompt which tells the interpreter to load some texts for us to explore: from ok import. This says "from nltk's book module, load all items." The book module contains all the data you will need as you read this chapter. After printing a welcome message, it loads the text of several books (this will take a few seconds). Here's the command again, together with the output that you will see.

In Python, it doesn't make sense to end an instruction with a plus sign. The python interpreter indicates the line where the problem occurred (line 1 of stdin, which stands for "standard input. Now that we can use the python interpreter, we're ready to start working with language data. 1.2 Getting Started with nltk before going further you should install nltk.0, downloadable for free from http nltk. Follow the instructions there to download the version required for your platform. Once you've installed nltk, start up the python interpreter as before, and install the data required for the book by ekg typing the following two commands at the python prompt, then selecting the book collection as shown.1. import nltk wnload figure.1: Downloading the nltk book collection: browse the available packages using wnload. The collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book.

division. The prompt indicates that the python interpreter is now waiting for input. When copying examples from this book, don't type the " " yourself. Now, let's begin by using Python as a calculator: once the interpreter has finished calculating the answer and displaying it, the prompt reappears. This means the python interpreter is waiting for another instruction. Note, your Turn: Enter a few more expressions of your own. You can use asterisk for multiplication and slash for division, and parentheses for bracketing expressions. The preceding examples demonstrate how you can work interactively with the python interpreter, experimenting with various expressions in the language to see what they. Now let's try a nonsensical expression to see how the interpreter handles it: 1 file " stdin line 1 1 syntaxError: invalid syntax this produced a syntax error.

Under Unix you can run Python from the shell by typing idle (if this is not installed, try typing python). The interpreter will print a blurb about your Python version; simply check that you are running Python.2 or later (here it is for.4.2 python.4.2 (default, Oct 15 2014, 22:01:37). Gcc.2.1 Compatible Apple llvm.1 (clang-503.0.40) on darwin. Type "help "copyright "credits" or "license" for more information. note, if you are unable to run the python interpreter, reviews you probably don't have python installed correctly. Please visit http python. Org/ for detailed instructions. Nltk.0 works for Python.6 and.7.

We're all very familiar with text, since we read and write it every day. Here we will treat text as raw data for the programs we write, programs that manipulate and analyze it in a variety of interesting ways. But before we can do this, we have to get started with the python interpreter. 1.1 Getting Started with Python, one of the friendly things about Python is that it allows you to type directly into the interactive interpreter — london the program that will be running your Python programs. You can access the python interpreter using a simple graphical interface called the Interactive development Environment (idle). On a mac you can find this under. Applications, macPython, and on Windows under, all Programs, python.

