B651: Homework 2

Homework 2: working with raw text

Due Monday, January 30, 11:59pm.

How to submit the assignment

Please turn in this homework by attaching your Python code and a short writeup (either plain text or PDF) in the "Assignments 2" section on the Oncourse site. The writeup should include: (a) instructions for running your program; (b) a sample transcript of running it, with some commentary about the output and what it means; and (c) answers to the problems listed below.

What to do

The Gutenberg corpus that comes with NLTK includes the following texts or sets of texts with at least 250,000 words each:

the King James version of the Bible
three novels by Jane Austen
three novels or collections of short stories by G. K. Chesterton.

If we consider the team of 47 translators responsible for the King James Bible one "author", we can use these texts to study the problem of distinguishing authors by properties of the texts they write.

Using the NLTK tools described in Chapters 1, 2.1, and 2.2, figure out three different ways to distinguish the three authors. Some possibilities: the relative frequency of particular words, of proper names (you can detect capital letters by importing the string module and using the string method string.uppercase), of words with particular suffixes (for example, -ism). Hint: Austen's novels are overwhelmingly concerned with romance, while Chesterton is more interested in religion. Note that the corpus reader methods raw(), words(), sents(), and paras() have a fileids option which can take a single fileid (a string) or a list of fileids as its value. It defaults to None, in which case all files are included.

One approach to disambiguating words is to consider the other words that they co-occur with. Consider the two main senses of the English word bank: (1) financial institution and (2) shore of a lake or river. We might expect bank in sense (1) to co-occur with words such as money and the second to occur with words such as river.

The word bank occurs 54 times, in 48 different paragraphs, in the Brown corpus. Using these examples and your own knowledge of the senses, come up with a set of 10-20 words that we might expect to co-occur in the same paragraph with one or the other, but not both, of the two different senses. The two lists of words separate the paragraphs containing bank into three sets: those containing one or more of one list of words, those containing one or more of the other list of words, and those containing words from neither list or both lists. The idea is that these contexts should separate the occurrences of bank into instances of sense (1), instances of sense (2), and miscellaneous cases.

Then use the Gutenberg corpus to test your method. You will probably want to leave out the older texts in the corpus (the Bible and the texts by Milton, Blake, and Shakespeare) because sense (1) of bank is relatively recent. There are 57 paragraphs containing bank in the remaining texts. Report on how successful this naive method is at separating the two senses and the miscellaneous cases in this corpus. (You can consider a random subset of 20 of the 57 cases.)

IU | INFO | CSCI