git-history: a tool for analyzing scraped data collected using Git and SQLite
December 12, 2021 9:58 AM   Subscribe

git-history: a tool for analyzing scraped data collected using Git and SQLite
I've been exploring a way of running web scrapers for a few years that I call Git scraping: the idea is to scrape a source of data (like a "current power outages" map from an electricity company) into a Git repository such that the history of that repository tells the story of changes to that information over time. git-history is a new command-line utility I've released which helps convert that collected history into a SQLite database to support analysis using my Datasette tool.
Role: Creator
posted by simonw (6 comments total) 1 user marked this as a favorite

WOW - I haven't gotten to take a thorough look at this yet but it's a BRILLIANT IDEA and I enthusiastically approve!

VERY cool.
posted by kristi at 7:45 PM on December 12, 2021 [1 favorite]


Fascinating -- nice work.

I wonder: could this be used to detect data manipulation? I'm thinking of something like the Florida COVID data that was supposedly manipulated by DeSantis, though it could be any dataset subject to nefarious oversight.
posted by lazaruslong at 3:38 AM on December 16, 2021


Nice project! This could be used with news sites as well, I guess?
posted by beesbees at 11:12 AM on December 16, 2021


Oh that's a really interesting thought: yes, if you were better at statistics than I am you could definitely use this technique to spot manipulated data!

For news sites: I've seen one example in the wild that's tracking the BBC via their RSS feeds: https://github.com/jasoncartwright/bbcrss
posted by simonw at 11:32 PM on December 18, 2021 [1 favorite]


That sounds really valuable!

I love working with the tiny, human-entered data sets that I help people manage for my living, and have been wanting for so long to learn how to do things like data scraping and cleaning with Python and the weird, wonderful formlessness (to a spreadsheet user) and flexibility of SQL… I’ve only ever dipped a toe in either. I read the Datasette post with excitement, and this one also inspires me to invest some effort in learning more so that I can explore it properly!
posted by rrrrrrrrrt at 8:50 PM on January 8, 2022 [1 favorite]


I love the idea of news edit detection. Local news I read (good ones even!) will change the text of their stories with no notice or diff.
posted by drowsy at 4:40 PM on January 27, 2022


« Older Show-biz Yiddish...   |   North London's angry & witty p... Newer »


You are not currently logged in. Log in or create a new account to post comments.