Due Monday, March 19, 11:59pm.
As discussed in classes, using regular expressions and a corpus with part-of-speech tags, it is possible to perform some shallow parsing, which may be suitable for some purposes. For example, we could find many simple noun phrases in English sentences by looking for ("chunking on") sequences of zero or one determiner (the, etc.), followed by zero or more adjectives (small, etc.), and a noun. We could find more complex noun phrases by chunking on prepositional phrases (in the street, etc.) and relative clauses (that we ate, etc.) and then chunking on the noun phrases which have these patterns after the noun.
NLTK makes it easy to create chunking grammars. Review, 7.1, 7.2, 7.3, and 7.6 from the chapter on extracting information from text in the NLTK book. These sections contain more information about chunking, as well as about how to use NLTK to create a chunking grammar.
Download these two files, which include some tagged Swahili sentences and a little code to help you.
The tags in
sw_sents are NN (noun), JJ (adjective), DT (determiner), VB (verb, non-relative), and VBR (relative verb).
hw4.py includes a function
read(), which reads in sentences from a file like the included file
sw_sents, returning a list of tuple pairs, one for each sentence.
The first element of the pair is a list of tuples, one for each word in the sentences; the second element is the sentence gloss in English.
sents() takes a list like that returned by
read() and returns a list of lists of word/tag tuples, suitable for handing to a chunking grammar.
nltk.RegexpParser(). You do not need to include a rule for entire sentences. Then use the function
parse()to parse the sentences in
sw_sents, handling as many as possible. (My grammar correctly chunked the noun phrases in all but the last sentence, which has a relative clause embedded in another relative clause.)