Outline of MDL

Goldsmith (2001): an example (1)

Corpus:
bake, bakes, baked, baking, vote, votes, voted, voting, list, lists, listed, listing, salt, salts, salted, salting, wait, waits, waited, waiting, mark, marks, marked, marking, pull, pulls, pulled, pulling

Length of the model List lengths `lambda(T) + lambda(F) + lambda(Sigma)` `log 28 = 4.80`
Stems bake bakes baked baking
vote votes voted voting
list lists listed listing
salt salts salted salting
wait waits waited waiting
mark marks marked marking
pull pulls pulled pulling
`sum_t (log26 * text(length)(t) + log ( [[W]] / [[t]] ) )` `150 * log 26 + 28 * log 28`
`= 839.67`
Suffixes NONE `sum_(f) (log26 * text(length)(f) + log ( [[W_A]] / [[f]] ))` `0`
Signatures {ptrs to all stems} {NONE} `sum_sigma log ( [[W]] / [[sigma]] ) +`
`sum_sigma [ lambda(t_sigma) + lambda(f_sigma) + sum_t^(t_sigma) log ( [[W]] / [[t]] ) + sum_f^(f_sigma) log ( [[sigma]] / [[f in sigma]] )]`
`29 * log 28 = 139.41`
`983.89`
Length of the compressed corpus `sum_{w in W} [w] [log([[W]]/[[sigma_w]]) + log([[sigma_w]] / [[t_w]]) + log([[sigma_w]] / [[f_w in sigma_w]])]` `28 * log 28 = 134.61`

Goldsmith (2001): an example (2)

Corpus:
bake, bakes, baked, baking, vote, votes, voted, voting, list, lists, listed, listing, salt, salts, salted, salting, wait, waits, waited, waiting, mark, marks, marked, marking, pull, pulls, pulled, pulling

Length of the model List lengths `lambda(T) + lambda(F) + lambda(Sigma)` `log 7 + log 6 + log 2 = 6.39`
Stems bak vot
list salt wait
mark pull
`sum_t (log26 * text(length)(t) + log ( [[W]] / [[t]] ) )` `26 * log 26 + 7 * log 7 = 141.86`
Suffixes NONE
e es s
ed ing
`sum_(f) (log26 * text(length)(f) + log ( [[W_A]] / [[f]] ))` `9 * log26 + 2 * log 10.3 + 2 * log 3.4 + log 4.6`
`= 54.76`
Signatures
*bak
*vot
*e
*es
*ed
*ing
*list
*salt
*wait
*mark
*pull
NONE
*s
*ed
*ing
`sum_sigma log ( [[W]] / [[sigma]] ) +`
`sum_sigma [ lambda(t_sigma) + lambda(f_sigma) + sum_t^(t_sigma) log ( [[W]] / [[t]] ) + sum_f^(f_sigma) log ( [[sigma]] / [[f in sigma]] )]`
`log 14 + log 5.6 +`
`log 2 + log 4 + 2 * log 3.5 + 4 * log 4 +`
`log 5 + log 4 + 5 * log 1.4 + 4 * log 4`
`= 35.66`
`233.13`
Length of the compressed corpus `sum_{w in W} [w] [log([[W]]/[[sigma_w]]) + log([[sigma_w]] / [[t_w]]) + log([[sigma_w]] / [[f_w in sigma_w]])]` `8 * (log 3.5 + log 2 + log 4) +`
`20 * (log 1.4 + log 5 + log 4)`
`= 134.61`