Skip to content

How does 'srilm gen num ' work?

Werkov edited this page Oct 12, 2011 · 2 revisions

Idea was that it chooses the most probable word for given context whene generating sentence.

Actual implementation is different. LM.cc lines 1097–1117 (LM::generateWord()) are most important. The conditional probabilities are put in a row on [0,1] interval and the word whose interval contains randomly generated number (from [0, 1)) is chosen as continuation.

I did simple calculations and the number of tested words depends on those conditional probs. distribution. $N$ is the size of vocabulary, $p_i$ is probability of $i$-th word conditioned by the current context, $X$ is random generated number for decision, $M(X)$ is number of tested words till we find desired "sum". $$ P[M(X) = m] = P[\sum_{i=1}^{m-1} p_i \le X < \sum_{i=1}^{m} p_i] = p_m,, $$ where we use the fact that $X$ is uniformly distributed on [0,1). Therefore $$ \mathbb{E}M(X) = \sum_{m=1}^N P[M(X) = m] m = \sum_{m=1}^N p_m m $$

Conclusion: this is not what we want for "nice" implementation of getAllPossibilities.

Clone this wiki locally