Friday, January 4, 2013

Text generation using Markov chains.

Introduction

If you are already familiar with finite automata, then understanding a Markov chain will not be as hard. If you are not familiar with one, you can think of finite automata as a flowchart, with states and paths leading from one state to another based on some decision being made. An example could be a room with a light switch; there are the two states of having the light on or off. Using the light switch will transition you between the two states.

The difference between standard finite automata and Markov chains is that instead of having some definite outcome determining each path (eg. yes or no), they are probabilities. Take for example a coin flip, there is a 50% (0.5) chance that it could either be heads or tails.

An Example

Let's explore Markov chains a little deeper with respect to natural languages and take a look at some sample text that will form the corpus of our analysis.

I like turtles. I like rabbits. I don't like snails.

This would render a Markov chain as follows:

The above chain means that, among the sentences:

  • 100% (1.0) will start with I.
  • This will be followed by like for 66% (0.66) of the time and don't for 33% (0.33) of the time.
  • The word don't will always (100% / 1.0) be followed by like.
  • Then lastly, turtles, rabbits and snails will all follow like 33% (0.33) of the time.

The nice thing about Markov chains, with respect to natural language modelling, is that they form word dependencies without any knowledge of the language's syntax or semantics. The chain gets created purely based on statistical knowledge extracted from the corpus.

Text Generation

Random Selection

Using the Markov chain example from before and taking the weight of probabilities into account, we can randomly traverse the chain to generate new sentences like:

I don't like turtles. I like snails.

Although, our small corpus does not make for very interesting or new text. We could make it more interesting by adding to it or switching to a much larger corpus altogether. In my case, I decided to take all of Shakespear's sonnets and generate the text from that. You can see some example code that uses it as the corpus below.

The Code

The Markov chains will be built up for each new sentence it comes across. The critical parts to the MarkovChain object are its:

  • Cardinalities: The cardinality of the sample space for a given state. In our original example, the cardinality of like is 3 since snails, rabbits and turtles follow out of the like state. The cardinality of I would be 2 since don't and like follow.
  • Occurrences: The edge valued occurrences between the different states. For example, a transition from the I to like state occurred twice. The probability can then be easily calculated by taking the number of occurrences on a transition and divide it by the cardinality of the state's sample space. Note, there is also a special start state transition that can be used at the beginning of each new sentence generation.

Using these two parts, a random walk can be made across a chain path, starting from the special start state. Each walk will generate a new sentence based on probabilities of the transitions along the path.



Additional Resources