1,228,178 genetic variants, 2 million years
October 3, 2021 1:13 PM   Subscribe

1,228,178 genetic variants, 2 million years
I have mapped, graphed, and sonified all of the single nucleotide genetic variants which were found in more than one person from human chromosome 22. Genomes were collected and sequenced by the Human Genome Diversity Project, the Simons Genome Diversity Project, and the 1,000 Genomes Project, and were dated by Wohns et al. in A unified genealogy of modern and ancient genomes. The end result is a 5 part, 42 hour video series.

The world map is Bertin's 1953 projection, as implemented by Philippe Rivière.

The chromosome map uses Jakub Červený's generalized Hilbert ("gilbert") space-filling curve.
Role: Made video and audio from data
posted by clawsoon (12 comments total) 3 users marked this as a favorite

Well, I haven't watched the whole thing yet, so don't spoil the ending for me. But this is great. I really love the use of the Hilbert curve as a border, it's very visually appealing. It reminds me of a meandros curve.
posted by biogeo at 5:59 PM on October 3

Thanks! One special feature of Hilbert/gilbert curves (which you might already be familiar with) that makes them mathematically appealing is that you can multiply the resolution of the curve by whatever arbitrary amount that you'd like and any point on it will stay at basically the same place in 2D space. That's not a feature many other curves have. As usual, 3Blue1Brown explains it well.

It turns out that the one insight I got by staring at these videos for hours on end was admirably summed up (and visualized) in a paper from the end of last year, A variant-centric perspective on geographic patterns of human allele frequency variation. I won't spoil it for you by quoting the abstract. :-)
posted by clawsoon at 6:30 PM on October 3 [1 favorite]

I think I might've screwed up how I read the data.

I may have to rebuild it all to make it correct.


Investigation ongoing...
posted by clawsoon at 1:10 PM on October 5

That looks tidy, intriguing and HUGE. I've been in the binfo business for more than 30 years and part of that has been trying to abstract patterns from the noise. I lacked the coding chops - and the finish - to make anything that I wasn't slightly ashamed to make public. I'm glad, for example, we never got funded for that Science Gallery proposal to make a hepatitis C virus genome [a paltry 9,000 bp] out of four colour flashing LED light-strings. I forget WTF was the point of that. Hats off to you, I am impressed.
posted by BobTheScientist at 1:27 PM on October 5

A flashing LED hep C genome sounds great, BobTheScientist. You might be glad you never got funded, but I'm mildly disappointed, lol. :-D

I will say that discovering that I've just put out a giant chunk of visualized data that's partly wrong is tweaking my "slightly ashamed" and "do I have the finish?" buttons right at this very moment. It's not completely wrong, it's just not completely right, either. It'll be interesting to see how it actually looks if I can maintain the ambition to correct it.
posted by clawsoon at 3:41 PM on October 5

Onwards ad astra! I've had three good / original ideas in a life-time of science and the second, which visualised something nifty about the very first eukaryotic chromosome ever sequenced, turns out to be probably wrong through over-fitting; or at least not a general rule. Just as well [the mortification!] Nature rejected that ms.
posted by BobTheScientist at 1:22 AM on October 6 [1 favorite]

I believe I have come up with a reasonable solution, and rendering of the new movies will hopefully be done in a week or two.

If anyone is interested, I was double counting (and triple counting and occasionally 1000-counting) because the mutations in the data source I'm using are arranged in tree structures. (Building trees is part of how they estimate the dates.) They have a handy function for finding leaf nodes, but the leaf nodes returned are all the leaf nodes underneath that mutation, including on sub-branches where there have been additional mutations at that site.

So if a C changed to a G for the ancestor of 1000 people, then it changed back to a C for the ancestor of 400 of those people, then it changed back to a G for the ancestor of 100 of those people, I was counting 1000 people as having the first C>G mutation, 400 people as having the G>C mutation, and 100 people as having the second C>G mutation. I was counting 1000 people as 1500 people.

I have now switched to counting alleles instead of mutation events. So now 700 people will be counted as C>G, 300 people won't be counted because they're back at the ancestral C state, and the date given will be that of the earliest C>G event at that site. This also means that mutations which were calculated to have occurred on separate trees will be lumped together, so a C>G that happened 50 years ago in Papua New Guinea will be grouped with a C>G that happened 100 years ago in France. I'm replacing genealogical relationships with genetic relationships.
posted by clawsoon at 12:00 PM on October 7 [1 favorite]

Corrected version is up:

24 hours, 703,579 genetic variants, 2 million years

As mentioned in my previous comment, I've switched from mapping mutation events to mapping alleles, with the date given being that of the earliest estimated mutation event for that allele.

One interesting consequence of switching from "only include a mutation if it's found in more than one person" to "only include an allele if it's found in more than one person" is that I'm now showing a whole bunch of time=0 mutations. More than 10% of my total are alleles found in more than one person which were inferred to be the result of independent mutations for each person.

You can see them in the last 3 hours of the 5th video, labelled "Less than 25 years". Many of them are geographically close (multiple people in Japan, say, or Gambian Mandinka and Yoruba, or San and Ju hoan) in a way which suggests that some (many?) inferred separate mutation events may have been the same mutation events.
posted by clawsoon at 6:05 AM on October 12

really dig the sonification
posted by 20 year lurk at 11:57 AM on October 14 [1 favorite]

Thanks! I tried it on a bunch of different MIDI instruments, and it was fascinating what a different character each one gave it. Piano made it sound aggressively avant garde. Organ was just a mushy blob. And on the harp, I concluded, you can play pretty much whatever random thing you'd like and it'll sound nice.
posted by clawsoon at 1:00 PM on October 14 [1 favorite]

i note it as harp in the higher register and upon the occasional arpeggio, but listening, it strikes me as exceptionally regular up-tempo kalimba.
posted by 20 year lurk at 1:26 PM on October 14 [1 favorite]

Interesting. I'm not sure what the ultimate source of the harp samples is, other than that they're the default soundfont that comes with Timidity++ on Debian. It'd be interesting to hear it with really high-quality instrument samples.

Or have an actual harp (or actual kalimba) play really fast for 24 hours. :-)
posted by clawsoon at 2:30 PM on October 14 [1 favorite]

« Older Wildcard Bicycle Novelties...   |   The Islands Of The World ... Newer »

You are not currently logged in. Log in or create a new account to post comments.