In honor of International Women's Day, I decided to do an experiment to see what GPT-3 might reveal about human gender bias. And boy did it reveal a lot!
AI Jukebox is a fascinating project from OpenAI that uses cutting-edge neural neworks to perform all sorts of musical magic -- it can take a clip of a song and continue it in a new way, sing text lyrics in any artist's voice, make a song sound like it's being sung by someone else. My favorite? Tell it to generate music by an artist without any other info, and it will produce a gibberish song with nonsense lyrics... that still sounds 100% real and just like the actual singer or band with their unique style. You can hear instruments, melodies, sometimes an audience, the breathing of the lead singer -- but the whole thing is generated completely from scratch by the AI, not with samples or digital sounds. It's not flawless -- some of the songs ramble, with glitchy effects or a mutating voice. But these just add to the vibe, like it's from a dream or a parallel universe. I went through their database to find the best examples of these tracks from the most famous artists, then turned them into an audio quiz on Sporcle -- complete with AI-generated art of the artists I made to serve as hints in the second round. How many of the artists can you name? [more inside]
Many fields use the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve as standard evaluations of binary classification methods. Analysis of ROC and PR, however, often gives misleading and inflated performance evaluations, especially with an imbalanced ground truth. In our preprint, "The MCC-F1 curve: a performance evaluation technique for binary classification", we propose the MCC-F1 curve to address these drawbacks. The MCC-F1 curve combines two informative single-threshold metrics, Matthews correlation coefficient (MCC) and the F1 score. The MCC-F1 curve more clearly differentiates good and bad classifiers, even with imbalanced ground truths. We also introduce the MCC-F1 metric, which provides a single value that integrates many aspects of classifier performance across the whole range of classification thresholds. This project is an R package that plots MCC-F1 curves and calculates related metrics.
Our new method predicts transcription factor and chromatin factor locations in a cell type using new kinds of data like chromatin factor binding in other cell types and learning the association of gene expression patterns with chromatin factor binding patterns. We've made free software available and a track hub that can load our predictions for 36 chromatin factors in 33 human tissue types into the UCSC Genome Browser.
What does critical data science add to our understanding of sexual harassment in academia? [more inside]
The second major version of semi-automated genome annotation software Segway is now available! It now runs on any Linux system, no longer needing a fancy compute cluster with Grid Engine or LSF. Now you can also install it and all its dependencies with a single Bioconda command!
conda install -c bioconda segway. Includes fancy new modeling methods like mixtures of Gaussians. Turns out a single Gaussian isn't the best distribution for genomic signal data. Who knew? Previously.
A twitter bot that uses machine learning to define invented words, posting truncated definitions on Twitter and complete ones on Tumblr. Tweet @lexiconjure a made-up word, and it'll define it for you. [more inside]
Neuralsnap generates an image caption using a model I trained (convolutional and recurrent neural networks), then uses another character-level recurrent neural net that I trained on ~40 MB of poetry to expand the caption into a poem. (In this example, generated from a Rothko painting, the red text is the direct image caption, and the rest is the poetic expansion.) [more inside]
The free Segway software package contains a novel method for analyzing multiple tracks of functional genomics data. Our method uses a dynamic Bayesian network (DBN) model, which enables it to analyze the entire genome at 1-bp resolution even in the face of heterogeneous patterns of missing data. This method is the first application of DBN techniques to genome-scale data and the first genomic segmentation method designed for use with the maximum resolution data available from ChIP-seq experiments without downsampling. Our software has extensive documentation and was designed from the outset with external users in mind. Researchers at other universities and institutes have already installed and used Segway for their own projects.