# Homework 5: probabilistic context-free grammars

Due Monday, Apr. 16, 11:59pm.

In this assignment you'll write a program that learns the probabilities in an probabilistic context-free grammar (PCFG) in an unsupervised fashion using the expectation-maximization algorithm. You'll start with a small corpus and an unweighted grammar which assigns multiple parses to many of the sentences. Using the frequencies of the occurrences of different rules in the grammar across the corpus, you'll adjust the probabilities on the rules until they converge. Finally you'll compare the results from your trained PCFG to a PCFG trained in a supervised fashion on a treebank.

### How to submit the assignment

Please turn in this homework by attaching your Python code and a short writeup (either plain text or PDF) in the "Assignments 2" section on the Oncourse site. The writeup should include: (a) instructions for running your program, (b) a sample transcript of running it, and (c) answers to the questions described in 3 below.

## Background

A probabilistic context-free grammar has a probability associated with each rule. The probabilities for all of the rules with the same left-hand side must sum to 1.0. The probability assigned to a parse returned by a parser using the grammar is the product of the probabilities of the rules that are used in the parse.
`P(T) = prod_{i=1}^n P(R_i|L_i)`
where T is the parse tree, i indexes the rules used to expand each of the n non-terminal nodes in the parse tree, and Ri and Li are the right- and left-hand sides of rule i.

If we have a treebank of parsed sentences, we can learn the probabilities for each rule in a PCFG by counting the times a rule LR is used and dividing this by the total of the counts for all rules with L as left-hand side.

If we don't have a treebank but we have a non-probabilistic parser, we can use expectation maximization (EM) to estimate the rule probabilities.

1. Start with probabilities distributed equally among the rules with the same left-hand side.
2. Run the parser on the sentences in the corpus, and compute a probability for each parse, using the current rule probabilities.
3. Re-estimate the rule probabilities
1. For each rule, count the number of times it is used in parsing the corpus, weighting the counts by the probabilities of the parses.
2. Convert the counts to probabilities by dividing by the total weighted counts for all rules with the same left-hand side.
4. Repeat 2 and 3 until the rule probabilities converge.

Note that this application of EM corresponds very closely to its application to estimating the translation model parameters for statistical machine translation.

## What you have to do

Download these modules and text files. You will be using various NLTK classes and functions related to context-free grammars and parsing; see Chapter 8, especially 8.3, 8.4, and 8.8, in the NLTK Book for information on how CFGs and parsing are handled.

You will work with a short corpus consisting of the first 324 sentences of the story Sophie's World that have less than 16 words, from the SMULTRON treebank. The XML file `smultron_en_sophie.xml` contains the original treebank data (including sentences with 16 or more words as well). The file `short.cfg` contains the rules (productions) for a non-probabilistic context-free grammar extracted from the SMULTRON treebank. Note that, as with many treebank grammars, there is a very large number of part-of-speech categories. For this assignment, it's not necessary that you be familiar with these. Note also that "sentences" in this corpus may actually be fragments of sentences, so the top-level category is `TOP` rather than `S`.

1. Write the code that estimates the rule weights using EM. The sentences to be parsed are in the file `short_tagged_sentences`; the unweighted grammar is in `short.cfg`. The parser to use is `nltk.parse.pchart.InsideChartParser`. Since there is no lexicon, you will be treating POS tags rather than words as terminals when you parse. `InsideChartParser` allows you to set the beam size for the search and the number of parses returned. Try a value like 800 for the beam size and a value like 30 for the number of parses returned. You will need various functions and classes from the NLTK grammar module.
2. Now parse the sentences with your weighted grammar, using the probabilistic parser `nltk.parse.viterbi.ViterbiParser`. Compare your results with the corresponding trees from the treebank, which are in the file `short_parsed_sentences`, using the code in `parseval.py` to do the tree comparison. Also compare your results to the output of the parser using the grammar with probabilities learned directly from the treebank; that grammar is in the file `observed_from_treebank.pcfg`.