Homework 5: probabilistic context-free grammars

Due Monday, Apr. 16, 11:59pm.

In this assignment you'll write a program that learns the probabilities in an probabilistic context-free grammar (PCFG) in an unsupervised fashion using the expectation-maximization algorithm. You'll start with a small corpus and an unweighted grammar which assigns multiple parses to many of the sentences. Using the frequencies of the occurrences of different rules in the grammar across the corpus, you'll adjust the probabilities on the rules until they converge. Finally you'll compare the results from your trained PCFG to a PCFG trained in a supervised fashion on a treebank.

How to submit the assignment

Please turn in this homework by attaching your Python code and a short writeup (either plain text or PDF) in the "Assignments 2" section on the Oncourse site. The writeup should include: (a) instructions for running your program, (b) a sample transcript of running it, and (c) answers to the questions described in 3 below.

Background

A probabilistic context-free grammar has a probability associated with each rule. The probabilities for all of the rules with the same left-hand side must sum to 1.0. The probability assigned to a parse returned by a parser using the grammar is the product of the probabilities of the rules that are used in the parse.
`P(T) = prod_{i=1}^n P(R_i|L_i)`
where T is the parse tree, i indexes the rules used to expand each of the n non-terminal nodes in the parse tree, and R_i and L_i are the right- and left-hand sides of rule i.

If we have a treebank of parsed sentences, we can learn the probabilities for each rule in a PCFG by counting the times a rule L → R is used and dividing this by the total of the counts for all rules with L as left-hand side.

If we don't have a treebank but we have a non-probabilistic parser, we can use expectation maximization (EM) to estimate the rule probabilities.

Start with probabilities distributed equally among the rules with the same left-hand side.
Run the parser on the sentences in the corpus, and compute a probability for each parse, using the current rule probabilities.
Re-estimate the rule probabilities
1. For each rule, count the number of times it is used in parsing the corpus, weighting the counts by the probabilities of the parses.
2. Convert the counts to probabilities by dividing by the total weighted counts for all rules with the same left-hand side.
Repeat 2 and 3 until the rule probabilities converge.

Note that this application of EM corresponds very closely to its application to estimating the translation model parameters for statistical machine translation.

What you have to do

Download these modules and text files. You will be using various NLTK classes and functions related to context-free grammars and parsing; see Chapter 8, especially 8.3, 8.4, and 8.8, in the NLTK Book for information on how CFGs and parsing are handled.

You will work with a short corpus consisting of the first 324 sentences of the story Sophie's World that have less than 16 words, from the SMULTRON treebank. The XML file smultron_en_sophie.xml contains the original treebank data (including sentences with 16 or more words as well). The file short.cfg contains the rules (productions) for a non-probabilistic context-free grammar extracted from the SMULTRON treebank. Note that, as with many treebank grammars, there is a very large number of part-of-speech categories. For this assignment, it's not necessary that you be familiar with these. Note also that "sentences" in this corpus may actually be fragments of sentences, so the top-level category is TOP rather than S.

Write the code that estimates the rule weights using EM. The sentences to be parsed are in the file short_tagged_sentences; the unweighted grammar is in short.cfg. The parser to use is nltk.parse.pchart.InsideChartParser. Since there is no lexicon, you will be treating POS tags rather than words as terminals when you parse. InsideChartParser allows you to set the beam size for the search and the number of parses returned. Try a value like 800 for the beam size and a value like 30 for the number of parses returned. You will need various functions and classes from the NLTK grammar module.
Now parse the sentences with your weighted grammar, using the probabilistic parser nltk.parse.viterbi.ViterbiParser. Compare your results with the corresponding trees from the treebank, which are in the file short_parsed_sentences, using the code in parseval.py to do the tree comparison. Also compare your results to the output of the parser using the grammar with probabilities learned directly from the treebank; that grammar is in the file observed_from_treebank.pcfg.
Answer these questions.
1. How do the grammars compare in terms of their precision and recall?
2. What are the differences between the top parses that you get with the two grammars? Are there any differences in ranking of likelihood of rules that might cause these? (For example, one grammar might think that NP → DET ADJ N is more likely than NP → DET N, but another grammar might disagree.)
3. It is possible for the EM approach to reach a set of probabilities that fails to parse some of the sentences, given the parser with beam search. How can this happen?

IU | INFO | CSCI