flat filer
June 4, 2011 5:47 PM   Subscribe

flat filer
I have been working on this for a while but somehow it had never occurred to me to post it here. It's not really a blog, per se. Rather, it is more of a reference or index. I describe it as "tools and techniques for textual data" (the alliteration was accidental). It's not an exhaustive collection but it is fairly comprehensive. Most of the low hanging fruit has been picked so posts now are few and far between. I would welcome any suggestions you might have. You can email me through MeFi...
Role: Curator
posted by jim in austin (13 comments total) 1 user marked this as a favorite

Awesome idea! Thanks for sharing.
posted by armoir from antproof case at 8:44 PM on June 5, 2011


Man. Just the other day I was trying to dig through reddit searching for a flat file database I recall seeing. I wish I had seen your site!
posted by pwnguin at 9:34 PM on June 5, 2011


the alliteration was accidental

I believe you. Thousands wouldn't.

Found a *nix text processing tool recently that I'd managed not to notice for years: look.
posted by flabdablet at 3:14 AM on June 6, 2011


Oooooo, look looks cool...
posted by jim in austin at 8:38 AM on June 6, 2011


Wow. Data Wrangler probably would have saved me three hours work this morning. I am going to give it a try.
posted by oulipian at 4:59 PM on June 7, 2011


GNU recutils looks really promising for a problem that's been vexing me for a while now. Bookmarked, as is your blog.
posted by suetanvil at 1:25 PM on June 16, 2011


Thanks for compiling this. I generally throw together tools as I need them in Perl, but this will definitely save some time. You can start posting obscure, but helpful, language-specific modules or functions if you start running out of tools. For example, htmlspecialchars() can do a better job than multiple preg_replace() to sanitize text in PHP.
posted by breadboard at 10:53 PM on June 23, 2011


I'm far more of a tool user than a tool maker. Simple throwaway scripts are pretty much the limit of my programming. I'm aware that almost all languages have facilities for handling textual data but I have avoided including that sort of thing on the site, with the possible exception of my fascination with AWK. More XML related tools is something I'm considering though...
posted by jim in austin at 5:25 AM on July 2, 2011


XML is a very poor fit with most of the traditional text processing tools, but you can make it far less tiresome by using an alternative representation called Pyx as an intermediary. Here for example is a script I wrote for manipulating the VirtualBox media registry, which is spread across a bunch of XML files; it uses XmlStarlet to convert XML to Pyx and vice versa.
posted by flabdablet at 6:04 AM on July 2, 2011


Pyx appears to be quite useful. Love the idea of being able to parse snarled, eye-glazing XML with a trivial one-liner. OReilly has a helpful page on it as well. Thanks for the suggestion...
posted by jim in austin at 9:27 AM on July 3, 2011


The one-liner in the page you linked,
xmln fig1.xml | grep "^-" | awk "{print substr($0,2)}" | wc -w
I would personally prefer as
xmln fig1.xml | sed -n '/^-/s///p' | wc -w
just because I like sed's conciseness, and it gives me an excuse to make good use of an obscure sed feature (the s command's empty substitution pattern means "the pattern matched by the address regexp").

There are several sed patterns that are useful with Pyx. One I quite like uses range addressing to select material between tags:
sed -n '
/^(tag$/,/^)tag$/{
	/^[()]tag$/!{
		# commands here see only stuff that was
		# between <tag> and </tag> in the original XML
	}
}'
Nested range addresses also map neatly to nested XML tags (can awk do nested range addresses? Can't remember):
sed -n '
/^(outer$/,/^)outer$/{
	/^(inner$/,/^)inner$/{
		/^[()]inner$/!{
			# commands here see only stuff that was
			# between <outer> ... <inner> and
			# </inner> ... </outer> in the original XML
		}
	}
}'

posted by flabdablet at 10:56 AM on July 3, 2011


(can awk do nested range addresses? Can't remember)

Found a discussion of the topic originally on comp.lang.awk...
posted by jim in austin at 12:18 PM on July 3, 2011


I love you. Seriously.
posted by cross_impact at 4:50 PM on November 15, 2011


« Older Panic! in the Kitchen...   |   Map of prevalent winds in Nort... Newer »


You are not currently logged in. Log in or create a new account to post comments.