Math-Bio seminar: "Mutations, genetic identity, and data granularity"
I will talk about two studies where new insights are gained after we work on a different level of data granularity. First, in collaboration with Sebastian Zoellner we analyzed ~36 million extremely rare variants (defined as singletons in ~4,000 individuals) uniformly ascertained in an as yet unpublished whole-genome sequencing dataset. Our goal is to estimate mutation rate variation across the genome, and to identify genomic and sequence-based predictors of such variation. We found that some genomic features, such as H3K36me3 peaks and CpG islands, can either increase or decrease mutation rates depending on the adjacent sequence context. This shows that their impact of mutations cannot be understood by studying all mutation subtypes in aggregate. In the second study, in collaboration with Noah Rosenberg we assessed the possibility of using an individual's microsatellite genotype data to find matched records in a database of SNP genotypes, even when they have no shared markers. By using ~1,000 samples analyzed on both the 13 tandem repeat markers in the FBI standard forensic panel and 650K common variants routinely typed in GWAS we demonstrate the feasibility of cross-identifying individuals between the criminal justice system on one hand and genetic or ancestry research on the other. These results add to the list of examples where group-level patterns cannot always be transferred to the individual level, or vice versa. Choosing the right granular level of inquiry thus continues to be one of the biggest challenges in data science.