Calvin and Markov
July 6, 2015 5:55 AM   Subscribe

Calvin and Markov
Calvin and Markov digests Calvin and Hobbes strips and generates endless new comics with random, semi-coherent dialogue using Markov chains. Here's some details about how I built it.

This is very much in the spirit of (and based on the code from) my Garkov project from a few years back, but better implemented. It renders images you can easily download or share if you want to.

Currently it's working off about 14 months of marked-up strip transcripts (out of ten years total potential material, it'd take another 25-30 hours of work to prep the remaining text), and is selecting from 36 different empty-word-balloon strip templates (out of about 3,000 total strips Watterson published). I may supplement both of those over time; interested folks could in principle help out with either, though marking up more of the transcripts would be the easier of the two to help with.
Role: Creator, coder, etc
posted by cortex (22 comments total) 6 users marked this as a favorite
This project was posted to MetaFilter by the man of twists and turns on July 6, 2015: Calvin and Markov

An excellent illustration of ironic process theory by way of The Game.
posted by jedicus at 6:39 AM on July 6, 2015

Not to nitpick a brilliant idea beautifully executed, but it really should be called Markov and Hobbes.
posted by Kattullus at 7:23 AM on July 6, 2015

I went back and forth! I also considered going reductive with "Calkov", to run with the Garkov heritage. But ultimately I just settled on one.

The argument I'll make for this landing spot is that it catches the looks-legit-but-wait flow of the thing—it's not immediately clear that somethings wrong with a strip, just as it's not immediately clear somethings up with "Calvin and...".
posted by cortex at 7:26 AM on July 6, 2015 [1 favorite]

I like this a lot. I've been itching for a reread of the whole strip ever since you started talking about it on Twitter, so maybe I'll help out with transcription along the way. Is there any more to the input format than the example in the blog post, or other quirks? I take it Spaceman Spiff is considered a separate character from Calvin, for instance. Does he or Susie get the "S:" designator?
posted by Hargrimm at 7:31 AM on July 6, 2015

Very well done. Particularly appreciate the writeup of how you did it.

Any idea how hard it would be to bundle up the transcription task to run on Amazon's Mechanical Turk?
posted by Nelson at 7:31 AM on July 6, 2015

Is there any more to the input format than the example in the blog post, or other quirks?

Not much more. I have a few little details that I should just put into a blog post along with maybe a per-year series of files for markup that willing folks could take dibs on in the comments.

Any idea how hard it would be to bundle up the transcription task to run on Amazon's Mechanical Turk?

I've never actually tried to put MT to work, so I really don't know. If the notional blog post turns out to be a good jumping off point for recommendations there, I can think about it.
posted by cortex at 8:10 AM on July 6, 2015

I'd be interested in assisting with transcription, if you like.
posted by Chrysostom at 9:20 AM on July 6, 2015

Very slick – the ones I've look at were seamless. ImageMagick is such a hassle that I've shied away from this kind of project, so much respect for gritting your way through that!
posted by ignignokt at 10:02 AM on July 6, 2015

For those interested in helping with transcription, here's a rundown on how I'm approaching it.

Simplest thing is probably to just send me an email and I can mail you a chunk of text to work on as you're able.
posted by cortex at 10:32 AM on July 6, 2015

Upon first glance, I worried whether a comic with much more meaning (and humor) could be Markov-ized as well as you did to Garfield. Maybe a little less LOL, but still very good. But I would've recommended you do Dilbert next (and still do).
posted by oneswellfoop at 12:38 PM on July 6, 2015

This is fantastic. Here's my favourite so far, it's kind of beautiful.
posted by oulipian at 1:09 PM on July 6, 2015 [1 favorite]

The writeup page is pretty fantastic.
posted by joseph conrad is fully awesome at 3:35 PM on July 6, 2015

Do you have the new version of the transcripts available anywhere?
posted by jjwiseman at 4:01 PM on July 6, 2015

Not yet, but I'll put it up sometime soon; I've got a few folks who've gotten chunks to work on so I figured I'd update the master file when those are back and stick that up.
posted by cortex at 4:14 PM on July 6, 2015

Ah, that explains it.
posted by churl at 7:08 PM on July 6, 2015

can someone explain how markov chain is being used to choose sentences? This will betray my ignorance, but how would markov results contrast with, say, random selection? What criteria is used? I guess in the context of genetics, it seems that there's a goal in mind (estimating phasing etc), but here I'm sort of at a loss.
posted by lzd at 11:24 PM on July 13, 2015

At the moment, each sentence is generated as its own thing divorced from any larger strip context; for each word balloon in a given strip definition, I switch to the computed Markov frequency table for that character and say "give me a random utterance from this character" until I get back something that will fit in the word balloon (or, in cases where several tries fail, a placedholder like "...", which usually only comes up for very small balloons).

So there's no mechanism for enforcing topical coherence from utterance to utterance. One thing my code does support but which I'm not using yet here is the ability to seed a generated sentence from a specific word; instead of asking for a totally random utterance, I could say "give me a random utterance containing the word 'Hobbes'" and the code would make an attempt to start with that seed word and build a sentence forward and backward. There's no guarantee it'd succeed at that (maybe the possible sentences containing the word are all too long to fit the balloon, say), so the fallback would be to shrug and get a truly random one instead.

One approach to using that technique, which I've dabbled with for other Markov projects with mixed results, is to take the content of e.g. word balloon #1 and try to build the sentence for word balloon #2 by using words from #1 as seeds. So if Calvin says "Ol' buddy ol' Hobbes", I could break that down into words, and ask for a sentence using "Hobbes" or "ol'" or "buddy" to see if I can get the next utterance to related topically to the previous one.

But there's issues that arise:

1. Hobbes may not have any utterances that contain some or all of those words, in which case he'd just fall back on producing a non-sequitur anyway. Not a problem per se, but means doing extra computational work to do end up doing the same old thing anyway.

2. Hobbes may have very few potential utterances that contain one of those words, in which case he'll end up responding very predictably to that prompt. This can ende up looking like the randomness of the process is broken.

3. The keyword may not really be topic-leading in a useful way; if Calvin says "Hobbes" (not an odd thing for him to say) but then Hobbes is induced to come up with something using his own name, the resulting utterance may feel less rather than more conversational.

4. Many keywords aren't interesting. If Calvin says "This is unfair" and Hobbes comes up with something using the word "this" or "is", it may not be in any way noticably related. One possible approach to avoiding this issue would be to rank the potential seed keywords in an utterance by rarity to try and chain off less frequent and so more topically interesting words and fail down the list toward more common words until one succeeds.

All of those are hazy issues, though; none are actually bad, per se, and may actually contribute to more interesting output of one sort or another, if still probably pretty weird and incoherent at that. I'll probably futz around with some of those ideas and see if the results are interesting enough to make part of the generation process. But I also look forward to folding in more edited transcripts as they come in to spice up the overall variety of the output.
posted by cortex at 8:05 AM on July 14, 2015 [2 favorites]

My favorite so far.
posted by like_neon at 5:44 AM on August 20, 2015 [1 favorite]

« Older How to Not Be a Bullying Mob...   |   Awesomelytics: cross-site real... Newer »

You are not currently logged in. Log in or create a new account to post comments.