Homework 3: statistical semantics

Due Wednesday, February 22, 11:59pm.

How HAL works

HAL (Hyperspace Analogue to Language, Lund & Burgess, 1996) is an example of a semantic space model of word meaning. The model starts with a large corpus of text in some language. For each of the words in the corpus that occurs more than some minimal number of times, it records the frequencies that other words in the corpus tend to occur before and after the word within a given window size. It does this by storing a vector of co-occurrences for each word that is as long as the number of different words. Together these co-occurrence vectors make up the rows of a square matrix of co-occurrences. For each appearance of a word in the corpus, these co-occurrences are updated. Only the context before the word is recorded because the context after the word will appear in the column in the matrix that corresponds to that word. The amount that is added for each co-occurrence decreases with the distance; it is the window size minus the distance between the words (where adjacent words have a distance of 0), and it is 0 for distances greater than the window size.

The table below illustrates the values that would be recorded if the entire corpus consisted of the sentence it's the end of the glazed donut drought, if the window size were 5, and if the minimum occurrence frequency of the recorded words were set to 1.

          its    the    end    of    glazed donut  drought
its        -      0      0      0      0      0      0
the        7      -      4      5      0      0      0
end        4      5      -      0      0      0      0
of         3      4      5      -      0      0      0
glazed     1      7      3      4      -      0      0
donut      0      5      2      3      5      -      0
drought    0      3      1      2      4      5      -

Next a "meaning" vector is created for each word by concatenating its row vector and its column vector, thus including both the preceding and following contexts of the words. For example, in the example above, the meaning vector for the becomes [7, 0, 4, 5, 0, 0, 0, 0, 0, 5, 4, 7, 5, 3]

The meaning vectors can now be compared. For this we have to use a distance measure that is suitable for comparing vectors of different lengths. Cosine similarity is such a measure. Given our somewhat limited corpus of 8 word tokens and 7 word types, we don't get very interesting results. For example, the similarity between end and its, 0.542, is greater than the similarity between end and donut, 0.326.

Implementing HAL

To implement and test HAL, we need some data. For this, we will use the Brown corpus, which is accessible in NLTK.

To use the Brown corpus in NLTK, import it:

>>> from nltk.corpus import brown

Then you can use the corpus method words to get the words in the corpus as a long list of 1,161,192 tokens (punctuation is treated as words). For more on using the corpora in NLTK, see Chapter 2 of the NLTK book.

The list you will have consists of word tokens; you need to extract word types from this and then extract only the most frequent types from this. I suggest working only with words that occur at least 100 times. There should be 1092 of these.

Next for each of the word types, you need to create a meaning vector. In my version, I created a class for the whole matrix and a class for each word (though you don't have to do it this way). The following initializes the matrix with a row and column for each word. It also records where each word is, so that words' rows and vectors can be looked up by their strings and the system will know which words are not in the matrix at all (like 'python' in this example). (This assumes that words is bound to the list of frequent word types.

>>> memory = Memory(words)
>>> memory.get_position('make')
273
>>> memory.get_position('python')
-1

Next we need to record the co-occurrence statistics for the text, that is, the whole original list of word tokens, yielding something like what appears in the table above. For each of the frequent word types, you find each of its occurrences in the text, and then within the window of size WINDOW before it, you record its co-occurrences with other frequent word types, weighted by their distance from the word. I used a WINDOW of 5. Unless you implement this in a clever way, it will take a very long time to go through all the words, possibly hours, so you should start with a much smaller text while debugging your program.

>>> memory.record(text)
>>> memory.get_word('make')
[0, 1, 1, 1, 2, 0, 0, 12, 0, 2, 0, 20, 0, 0, 0, 0, 0, 267, 5, 33,
20, 0, 1, 0, 1, 0, 27, 8, 1, 43, 0, 2, 22, 0, 0, 12, 0, 0, 23, 0,
0, 0, 0, 0, 13, 0, 3, 0, 4, 4, 15, 4, 0, 4, 0, 0, 0, 0, 0, 1, ...]

Finally we need to make the meaning vector for each word by concatenating its row vector and column vector in the matrix

>>> memory.make_meanings()
>>> memory.get_word('make').meaning
[0, 1, 1, 1, 2, 0, 0, 12, 0, 2, 0, 20, 0, 0, 0, 0, 0, 267, 5, 33,
20, 0, 1, 0, 1, 0, 27, 8, 1, 43, 0, 2, 22, 0, 0, 12, 0, 0, 23, 0,
0, 0, 0, 0, 13, 0, 3, 0, 4, 4, 15, 4, 0, 4, 0, 0, 0, 0, 0, 1,
...,
0, 0, 6, 0, 0, 5, 0, 17, 3, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 10, 111,
0, 0, 0, 50, 0, 0, 0, 8, 0, 0, 6, 0, 3, 0, 0, 0, 4, 0, 0, 1, 0, 0,
7, 8, 1, 6, 0, 2]

Now we can compare the meaning vectors for words to see how well the algorithm works. We would expect word pairs like 'man'/'woman' and 'a'/'the' to be relatively similar, for example. Here is a function you can use that returns the cosine similarity between two vectors.

def cosine(vec1, vec2):
    '''Cosine similarity between two vectors (sequences).'''
    len1 = math.sqrt(sum([x**2 for x in vec1]))
    len2 = math.sqrt(sum([x**2 for x in vec2]))
    dot = sum([x*y for x,y in zip(vec1, vec2)])
    return dot / (len1 * len2)

Of course how successful the compare function is depends on the amount of data that went into the computation.

>>> mem.compare('man', 'woman')
0.95833361171489606
>>> mem.compare('man', 'believe')
0.43955269305711103
>>> mem.compare('think', 'believe')
0.82563278902519588

Select 20 pairs of words to test, show how your program responds, and comment on the results.

IU | INFO | CSCI