Where Do the Probabilities Come From?

Take a sample of English text, and calculate how often different letters occur in it.

  • Break this into words by adding in spaces as if they were letters with a certain probability, and generate words by forcing the distribution of “word lengths” to agree with what it is in English
  • Generate sentences by randomly choosing words at random, with the same probability that it appears in the corpus
  • There are about 40,000,000 commonly used words in English, so we can estimate how common each word is.
