Mini-Project: Convert exported Metafilter comments to HTML, JSON, or MBOX
October 29, 2021 3:08 PM   Subscribe

Mini-Project: Convert exported Metafilter comments to HTML, JSON, or MBOX
I wrote a little utility to convert the massive text file one obtains from the Export Your Comments page into a variety of other formats suitable for various purposes. Currently converts to HTML, JSON, or Unix-style MBOX (mailbox) format.

A bunch of years ago (okay, 12.5 years ago—but who's counting?), I created a Python script to convert my exported Metafilter comments to XML, for some reason. I guess it seemed like a good idea at the time. I think I had some idea about turning lengthy comments into blog posts, or something. Alcohol was involved.

Anyway, the world has moved on and away from XML, which was a pain in the ass especially for encapsulating HTML.

Still desiring a way to narcissitically ponder my old comments, last week I wrote a new converter script. Like any good example of Second System Syndrome, this one runs much more slowly, uses an order of magnitude more memory, and requires Python 3.

Currently it allows you to convert your comments export file into HTML, JSON, or MBOX formats:
  • HTML mode is meant for casually reading through comments, and produces a single large file that's (probably) HTML5-compliant. Each comment entry is wrapped in a <div class="comment">, and the post text itself in <div class="comment-text">, so you can style it with CSS if you want. Out-of-the-box, comments are pretty readable, with the only exception being tables constructed using <pre> tags, which are ugly.
  • JSON mode is for people who want to engage in further nerdery with their contributions, and outputs a single array containing all comments, with each comment represented as an object with date, url, and html keys. Double quotes are escaped with leading forward-slashes, and Unicode entities are represented in ASCII, as per the JSON spec. I'm not really sure what the use case is for this one, but it was easy to implement, so there you go. Maybe feed it into an AI and shitpost forever from beyond the grave?
  • Unix Mailbox (mbox) mode is sort of my personal favorite, and converts each comment into a valid MIME multipart/mixed message containing the HTML content, and the original URL in a pseudo-signature block. Date/time posted is stored in the message Date header, which corresponds to the "Date Sent" in most email programs. The Message-ID is constructed from the Mefi subdomain (www, ask, metatalk, etc.), the post ID, and the comment ID in a way that's predictable, to allow for duplicate filtering. The References header is populated by a value unique to the post only, allowing multiple comments in the same discussion to be grouped, if your mail program supports it. It should also, in theory, allow comments from multiple people to be combined together and sorted correctly. The resulting mbox file seems to work fine when dumped directly into my Dovecot mailserver's mail boxes directory and viewed with Apple Mail.
Conversion of my comments (all 10,964 of 'em) takes just under one second to either HTML or JSON. MBOX conversion is significantly slower, taking about 20s to complete. It has been written for and tested on Python 3.9.1, and uses only standard libraries.

It's meant to be run from the command line, and should work on Mac, Linux, or Windows (as long as you have Python 3 installed). Usage is fairly simple, just specify the input filename and output filename you want to create. The processing mode is determined by the file extension specified on the output. Valid extensions are "json", "html", or "mbox".

Example: python3 my-mefi-comments.txt my-mefi-comments.json will convert the file "my-mefi-comments.txt" to JSON and write to "my-mefi-comments.json".

Comments, suggestions, etc. are welcome; feel free to fork and modify on Github if that's your thing, too. If anyone wants some other serialization format besides the three I've implemented, I could probably be convinced to add other options. (Protocol Buffers? ASN1/X.690? Maildir?)
Role: programmer
posted by Kadin2048 (5 comments total) 6 users marked this as a favorite

This is amazing, thank you! I just ran this on Linux and it only took 0.01129 seconds to parse 374 comments into a HTML file. Phenomenal!
posted by mundo at 10:46 AM on October 31, 2021

Thanks for creating this, I'm delighted to have more ways to port/archive our data.

Just a quick thing that I had to modify to get it working for me: the "except NameError:" should probably be "except IndexError". args will always be defined (so no name error), but it might be a list with insufficient length, so when you run it without specifying the input file it crashes with an IndexError.
posted by theyexpectresults at 10:02 PM on November 6, 2021

theyexpectresults: Nice catch! You are correct, those try/except blocks were supposed to catch IndexErrors.

Now I just have to figure out which of my other scripts I plagiarized that section from and change it there, too...
posted by Kadin2048 at 7:35 PM on November 23, 2021

This is cool! Looking at the previous thread, I'd suggest putting this in a repo, to allow for a README, easier updates, suggestions, etc.
posted by Pronoiac at 6:52 AM on August 28, 2022

Pronoiac: Very true. I kept meaning to make a proper repo for it, but just never did.

Here we go:

If anyone wants to modify and extend it, I'd be happy to take pull requests.

I've recently started to play around with the Obsidian notetaking program, and I was thinking of extending the script to generate Obsidian-flavor YAML+Markdown files, as part of a project that would also eventually involve auto-tagging the text with keywords to generate a knowledge graph. If you're interested in messing about with that sort of stuff, let me know.
posted by Kadin2048 at 7:49 PM on August 28, 2022

« Older The Worst House On The Internet...   |   Halloween is dead, long live H... Newer »

You are not currently logged in. Log in or create a new account to post comments.