git-history: a tool for analyzing scraped data collected using Git and SQLite
December 12, 2021 9:58 AM Subscribe
git-history: a tool for analyzing scraped data collected using Git and SQLite
I've been exploring a way of running web scrapers for a few years that I call Git scraping: the idea is to scrape a source of data (like a "current power outages" map from an electricity company) into a Git repository such that the history of that repository tells the story of changes to that information over time. git-history is a new command-line utility I've released which helps convert that collected history into a SQLite database to support analysis using my Datasette tool.
I've been exploring a way of running web scrapers for a few years that I call Git scraping: the idea is to scrape a source of data (like a "current power outages" map from an electricity company) into a Git repository such that the history of that repository tells the story of changes to that information over time. git-history is a new command-line utility I've released which helps convert that collected history into a SQLite database to support analysis using my Datasette tool.
Role: Creator
Fascinating -- nice work.
I wonder: could this be used to detect data manipulation? I'm thinking of something like the Florida COVID data that was supposedly manipulated by DeSantis, though it could be any dataset subject to nefarious oversight.
posted by lazaruslong at 3:38 AM on December 16, 2021
I wonder: could this be used to detect data manipulation? I'm thinking of something like the Florida COVID data that was supposedly manipulated by DeSantis, though it could be any dataset subject to nefarious oversight.
posted by lazaruslong at 3:38 AM on December 16, 2021
Nice project! This could be used with news sites as well, I guess?
posted by beesbees at 11:12 AM on December 16, 2021
posted by beesbees at 11:12 AM on December 16, 2021
Oh that's a really interesting thought: yes, if you were better at statistics than I am you could definitely use this technique to spot manipulated data!
For news sites: I've seen one example in the wild that's tracking the BBC via their RSS feeds: https://github.com/jasoncartwright/bbcrss
posted by simonw at 11:32 PM on December 18, 2021 [1 favorite]
For news sites: I've seen one example in the wild that's tracking the BBC via their RSS feeds: https://github.com/jasoncartwright/bbcrss
posted by simonw at 11:32 PM on December 18, 2021 [1 favorite]
That sounds really valuable!
I love working with the tiny, human-entered data sets that I help people manage for my living, and have been wanting for so long to learn how to do things like data scraping and cleaning with Python and the weird, wonderful formlessness (to a spreadsheet user) and flexibility of SQL… I’ve only ever dipped a toe in either. I read the Datasette post with excitement, and this one also inspires me to invest some effort in learning more so that I can explore it properly!
posted by rrrrrrrrrt at 8:50 PM on January 8, 2022 [1 favorite]
I love working with the tiny, human-entered data sets that I help people manage for my living, and have been wanting for so long to learn how to do things like data scraping and cleaning with Python and the weird, wonderful formlessness (to a spreadsheet user) and flexibility of SQL… I’ve only ever dipped a toe in either. I read the Datasette post with excitement, and this one also inspires me to invest some effort in learning more so that I can explore it properly!
posted by rrrrrrrrrt at 8:50 PM on January 8, 2022 [1 favorite]
I love the idea of news edit detection. Local news I read (good ones even!) will change the text of their stories with no notice or diff.
posted by drowsy at 4:40 PM on January 27, 2022
posted by drowsy at 4:40 PM on January 27, 2022
« Older Show-biz Yiddish... | North London's angry & witty p... Newer »
VERY cool.
posted by kristi at 7:45 PM on December 12, 2021 [1 favorite]