The idea of using this algorithm came from Kernighan and Pike's The Practice of Programming, chapter 3.
Running the Markov algorithm with prefix length n=2 on the text of L. Frank Baum's The Wonderful Wizard of Oz might result in something like the following.
Dorothy lived in the doorway and looked upon her right foot and said: "Oz has sent for you." So the Wizard in the basket. "Good-bye!" "Good-bye!" shouted everyone, and all eyes were turned upward to where the Great Oz." "Oh, indeed!" exclaimed the man. "He has more brains than he needs." "And I am something terrible." "But, I don't care who kills her. But until she is melted," explained the Scarecrow. He had been cut. "There," said he; "now you have earned the brains you will oil the joints of my legs, I shall know everything."
As you can see, the generated text makes little sense, but it seems to have some sense of the grammar, sentence structure, and dialogue from the original text. It approximates natural language in structure, if not in sense. The generated text has similar statistical properties to the original text: every three-word phrase in the output appeared in the input.
The algorithm builds statistics about the frequency with which words follow certain prefix groups of other words. For example, with prefix size n=1, each generated word is randomly chosen based on the probability that it may follow the previous word, as observed from the input text. With n=2, the next generated word is based on the previous two words, and so on. Small values of n tend to produce gibberish; longer ones tend to reproduce the input exactly. It is possible to think of running the algorithm at n=0, which would yield a truly random word order. This is not expected to work in your implementation.
This idea of "chains" puts a few serious constraints on the generated text. For example, the output will always begin with the first n words of the input. Similarly, if allowed to run to completion the algorithm will always end on the same word as the input. The cause of this behavior will become clear later.
If words w1, w2, ..., wn, wn+1 occur consecutively in the input, then wn+1 is said to be a suffix of the prefix w1, w2, ..., wn. Remember, n is called the prefix length.
If the same prefix-suffix pair exists x times in the input, then that suffix is said to have multiplicity x. Multiplicity of the suffix is important during text generation because suffixes are chosen according to their probability of following a given prefix as observed from the input. A suffix with multiplicity 2 has twice the probability of being chosen as a suffix with multiplicity 1.
Examine this passage from the New International Version of the Bible. It is our input. Notice the recurrence of certain phrases, coupled with the uniqueness of others.
Blessed are the poor in spirit,
for theirs is the kingdom of heaven.
Blessed are those who mourn,
for they will be comforted.
Blessed are the meek,
for they will inherit the earth.
Blessed are those who hunger and thirst for righteousness,
for they will be filled.
Blessed are the merciful,
for they will be shown mercy.
Blessed are the pure in heart,
for they will see God.
Blessed are the peacemakers,
for they will be called sons of God.
Matthew 5:3-9
Let's build a table of prefixes and suffixes when n=2 by considering only the prefixes from the first and last sentences (otherwise the table would become very large), but we shall discover all suffixes from the passage that match those prefixes.
It is clear that any generated text must start with "Blessed are", since those are the first n words of the input. However, to build our table, every word must have a prefix, and every n-word phrase must have at least one suffix. The prefixes for the first "Blessed" and "are" spill off the front of the paragraph; similarly, the suffix of "of God." spills off the end. Rather than make our algorithm's flow control more complex to handle these boundary cases, we shall introduce artificial data.
Let NONWORD represent some word that is not allowed to occur in the input. Then consider any words that need to exist but are outside of the paragraph to be NONWORD. Then we can say that the first two prefixes from this passage are "NONWORD NONWORD" and "NONWORD Blessed". Given prefix length n, we will need n artificial prefixes, the first with n NONWORDs, the second with n-1, and so on. Similarly, we shall say the suffix of "of God." is "NONWORD". The end result is that to begin text generation, we use a prefix of n NONWORDs, and the algorithm terminates if NONWORD is generated.
Now we can build the table.
Prefix | Suffixes |
---|---|
NONWORD NONWORD | Blessed |
NONWORD Blessed | are |
Blessed are |
the (multiplicity 5)
those (multiplicity 2) |
are the |
poor
meek, merciful, pure peacemakers, |
the poor | in |
poor in | spirit, |
in spirit, | for |
spirit, for | theirs |
for theirs | is |
theirs is | the |
is the | kingdom |
the kingdom | of |
kingdom of | heaven. |
of heaven. | Blessed |
the peacemakers, | for |
peacemakers, for | they |
for they | will |
they will |
be (multiplicity 4)
inherit see |
will be |
comforted.
filled. shown called |
be called | sons |
called sons | of |
sons of | God. |
of God. | NONWORD |
You can see that some phrases have many branches, while some lead to an unbreakable chain. For example, "the poor in spirit, for theirs is the kingdom of heaven." is a chain with no branchings. However, the prefix "Blessed are" has two branchings, and "are the" has five. Also notice that "the" follows "Blessed are" more often than "those". Also, you should now see why the first n words of the output are always the same as the input.
Now that we've defined prefixes, suffixes, and NONWORD, we can give pseudo-code for the Markov Chain algorithm. Note that there are no special cases in the flow control; the NONWORD data takes care of that.
// build data structures w[1]...w[n] = NONWORD; // beginning of text while( input remains ) { w[n+1] = next word; add w[n+1] to the group of suffixes of w[1]...w[n]; w[1]...w[n] = w[2]...w[n+1]; } add NONWORD to the group of suffixes of w[1]...w[n]; // end of text // output new text w[1]...w[n] = NONWORD; w[n+1] = the one suffix associated with a prefix that is all NONWORDs; while( w[n+1] != NONWORD ) { add w[n+1] to the result; w[1]...w[n] = w[2]...w[n+1]; w[n+1] = a randomly chosen suffix of w[1]...w[n]; /* note: The random choice should be weighted based on the multiplicity of each suffix. */ }
The pseudo-code makes fairly clear the procedural necessities of the algorithm. It should also be clear that you will need to utilize several different data structures:
Some of these design decisions have been dictated in the overview document.