15 March, 2011

I have been annoying my friends with IRC bots again.

The first one was a straight Markov-chain-based nonsense generator. It recorded the links between words and used them to build new sentences. For example, if you taught it “the cat sat on the mat” and “the dog sat down”, it knew that the word “the” could be followed by “cat”, “mat” and “dog”:

This probably wasn't the application Markov had in mind.

Resulting Markov chain

Sitting on someone else's mat has consequences.

One possible output

The problem is that, as the dictionary of words grows, the output becomes less and less satisfying. The word “the” could now be followed by thousands of words, most of which are things the poor cat can’t sit on. When we’re only using this one-word link, the sentences are very likely to become more nonsensical and ungrammatical as the dictionary grows.

The cat takes its revenge.

Another possible output

Version 2 is not a Markov chain*. Instead it stores longer snippets of sentences and uses this data to score the next word. So now, if the bot has generated “sat on the…” and needs the next word:

  • the hundreds of words that can follow “the” each have a small chance of being selected
  • the words that can follow “on the” have a higher chance of being selected (and these words are more likely to be objects that one might be “on”)
  • the few words that can follow “sat on the” have a higher again chance of being selected (and these words are more likely to be objects that can be sat upon)

… and so on.

The problem is that, once you’ve generated “the cat sat on the…”, the chance of choosing “mat” is very high. In other words, it’s increasingly likely to repeat verbatim something that someone once said. I think this will reduce as the dictionary grows, though.

Order is restored.


It’s currently sitting in the IRC channel I frequent and has learned about 400 words (mostly on the subjects of network failures, flying light aircraft and the bot itself). Time will tell whether version 2 is an improvement.

Nonsensegenerators tell you more about the human brain than about the data.

Even bots have a sense of humour.

*  Presumably there’s a name for what it is in Maths, but I don’t know it.



  1. > the subjects of network failures, flying light aircraft

    My bad.

    • Much more interesting than the reclining habits of household pets, if you ask me.

  2. I first saw this in The Practice of Programming. Go there, they have code. The keyword to search for is n-gram. It’s not clear to me if you have probabilities on those edges: If not, that would be the first improvement I’d try.

    (Your version 2 is a Markov chain of order n-1.)

    • Thanks for the link, I had a feeling you’d know more about it 🙂

      Would this still be considered a Markov chain, or a combination of many chains of different orders? (What is that n there?)

      Yes, the edges have probabilities built from the link frequencies in the output. I use them in the score calculations: score * sum of (word probability * length of snippet).

      • A Markov chain of order n uses at most n previous words to decide the output. A combination of Markov chains of orders m and n is a Markov chain of order max(m,n), so I guess the answer to your first question is … “yes”. 🙂

        Sorry for the question about probabilities: I didn’t read the post carefully the first time.

        More (brainstorming) ideas:
        1. Train the bot on a log of the IRC channel, if available.
        2. After you train it, I’d try to use only the longest n-gram available.
        3. Look at this other nonsense generator.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: