Many fields use the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve as standard evaluations of binary classification methods. Analysis of ROC and PR, however, often gives misleading and inflated performance evaluations, especially with an imbalanced ground truth. In our preprint, "The MCC-F1 curve: a performance evaluation technique for binary classification", we propose the MCC-F1 curve to address these drawbacks. The MCC-F1 curve combines two informative single-threshold metrics, Matthews correlation coefficient (MCC) and the F1 score. The MCC-F1 curve more clearly differentiates good and bad classifiers, even with imbalanced ground truths. We also introduce the MCC-F1 metric, which provides a single value that integrates many aspects of classifier performance across the whole range of classification thresholds. This project is an R package that plots MCC-F1 curves and calculates related metrics.
Our new method predicts transcription factor and chromatin factor locations in a cell type using new kinds of data like chromatin factor binding in other cell types and learning the association of gene expression patterns with chromatin factor binding patterns. We've made free software available and a track hub that can load our predictions for 36 chromatin factors in 33 human tissue types into the UCSC Genome Browser.
What does critical data science add to our understanding of sexual harassment in academia? [more inside]
The second major version of semi-automated genome annotation software Segway is now available! It now runs on any Linux system, no longer needing a fancy compute cluster with Grid Engine or LSF. Now you can also install it and all its dependencies with a single Bioconda command!
conda install -c bioconda segway. Includes fancy new modeling methods like mixtures of Gaussians. Turns out a single Gaussian isn't the best distribution for genomic signal data. Who knew? Previously.
A twitter bot that uses machine learning to define invented words, posting truncated definitions on Twitter and complete ones on Tumblr. Tweet @lexiconjure a made-up word, and it'll define it for you. [more inside]
Neuralsnap generates an image caption using a model I trained (convolutional and recurrent neural networks), then uses another character-level recurrent neural net that I trained on ~40 MB of poetry to expand the caption into a poem. (In this example, generated from a Rothko painting, the red text is the direct image caption, and the rest is the poetic expansion.) [more inside]
The free Segway software package contains a novel method for analyzing multiple tracks of functional genomics data. Our method uses a dynamic Bayesian network (DBN) model, which enables it to analyze the entire genome at 1-bp resolution even in the face of heterogeneous patterns of missing data. This method is the first application of DBN techniques to genome-scale data and the first genomic segmentation method designed for use with the maximum resolution data available from ChIP-seq experiments without downsampling. Our software has extensive documentation and was designed from the outset with external users in mind. Researchers at other universities and institutes have already installed and used Segway for their own projects.