May 6, 2022: Thesis Defense: Statistical Methods for Improving Data Quality in Modern RNA Sequencing Experiments, Zijian Ni, PhD Candidate, Kendziorski Lab
- Abstract: Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) are two groundbreaking and widely-used approaches in the last decade, allowing researchers to quantify gene expression in single-cell resolution or to profile gene activity patterns in 2-dimensional space across tissue. While useful, data collected from these techniques always come with noise, and appropriate filtering and cleaning are required for reliable downstream analyses. In this dissertation, I investigate multiple quality-related issues in scRNA-seq and ST experiments, and I develop, implement, evaluate and apply statistical methods to adjust for them. A unifying theme of this work is that all these methods aim at improving data quality and allowing for better power and precision in downstream analyses.
An important challenge in pre-processing scRNA-seq data is distinguishing barcodes associated with real cells from those binding background noise. Existing methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves the precision of downstream analyses.
Recent spatial transcriptomics experiments utilize slides containing thousands of spots with spot-specific barcodes that bind mRNA. Ideally, unique molecular identifiers at a spot measure spot-specific expression, but this is often not the case in practice owing to bleeding from nearby spots, an artifact we refer to as spot swapping. We propose SpotClean, a probabilistic model to adjust for spot swapping. SpotClean increases the power and precision of downstream analyses, with highlights in improved cancer marker signals, tumor/normal delineation, and tissue annotation in cancer studies
May 6, 2022: BDS PhD Student Rotation Presentations, Jiren Sun, Parth Khatri, Nathan Kolbow, Wallance Wei, Colin Longhurst
- Seminar Poster
- Presentation Titles:
- Jiren Sun, BDS PhD Student: “Statistical Methods for Analyzing Stepped Wedge Cluster Randomized Trials: A Selective Review”
Mentor: Jiwei Zhao
- Parth Khatri, BDS PhD Student: “Comparing Normalization Methods for Spatial Transcriptomics”
Mentor: Christina Kendziorski
- Nathan Kolbow, Statistics Biostatistics PhD Student: “Random Sample Network Consensus”
Mentor: Claudia Solis-Lemus
- Wallance Wei, BDS PhD Student: “Prediction performance comparison between absolute abundant data and relative abundant data in microbial community”
Mentor: Guanhua Chen
- Colin Longhurst, BDS PhD Student: “Hierarchical models for polychotomous data: Examining medical study participation in African Americans”
Mentor: Rick Chappell
May 2, 2022: Spring MS Graduation Presentations
April 29, 2022: Network Functional Varying Coefficient Model; Dr. Yanyuan Ma, Department of Statistics, Penn State University
- Seminar Poster
- Abstract: We consider functional responses with network dependence observed for each individual at irregular time points. To model both the interindividual dependence and within-individual dynamic correlation, we propose a network functional varying coefficient (NFVC) model. The response of each individual is characterized by a linear combination of responses from its connected nodes and its exogenous covariates. All the model coefficients are allowed to be time dependent. The NFVC model adds to the richness of both the classical network autoregression model and the functional regression models. To overcome the complexity caused by the network interdependence, we devise a special nonparametric least-squares-type estimator, which is feasible when the responses are observed at irregular time points for different individuals. The estimator takes advantage of the sparsity of the network structure to reduce the computational burden. To further conduct the functional principal component analysis, a novel within-individual covariance function estimation method is proposed and studied. Theoretical properties of our estimators, which involve techniques related to empirical processes, nonparametrics, functional data analysis and various concentration inequalities, are analyzed. We analyze a social network dataset to illustrate the powerfulness of the proposed procedure.
April 22, 2022: Deep learning oracles for genomic discovery; Dr. Anshul Kundaje, Assistant Professor, Departments of Genetics and Computer Science, Stanford University
- Seminar Poster
- Abstract: The human genome sequence contains the fundamental code that defines the identity and function of all the cell types and tissues in the human body. Genes are functional sequence units that encode for proteins. But they account for just about 2% of the 3 billion long human genome sequence. What does the rest of the genome encode? How is gene activity controlled in each cell type? Where do the regulatory control elements lie and what is their sequence composition? How do variants and mutations in the genome sequence affect cellular function and disease? These are fundamental questions that remain largely unanswered. The regulatory code that controls gene activity is encoded in the DNA sequence of millions of cell type specific regulatory DNA elements in the form of functional sequence syntax. This regulatory code has remained largely elusive despite exciting developments in experimental techniques to profile molecular properties of regulatory DNA. To address this challenge, we have developed high performance neural networks that can learn de-novo representations of regulatory DNA sequence to map genome-wide molecular profiles of protein DNA interactions and chromatin state at single base resolution across diverse cellular contexts. We have developed methods to interpret DNA sequences through the lens of the models and extract local and global predictive syntactic patterns revealing many insights into the regulatory code. Our models also serve as in silico oracles to predict the effects of natural and disease-associated genetic variation i.e. how differences in DNA sequence across healthy and diseased individuals are likely to affect molecular mechanisms associated with common and rare diseases. These models enable optimized design of genome perturbation approaches to decipher functional properties of DNA and variants and serve as a powerful lens for genomic discovery.
April 1, 2022: Genetic regulation: lessons from the human transcriptome; Associate Professor Barbara Stranger, Department of Pharmacology, Northwestern University
- Seminar Poster
- Abstract: During the past decade studies have demonstrated genetic effects on the transcriptome in humans and linked these regulatory mechanisms to complex traits and diseases. In this talk, I will present analyses demonstrating the context specificity of these genetic effects, with special focus on the Genotype-Tissue Expression (GTEx) data, comprised of 15,201 RNA-sequencing samples from 49 tissues of 838 postmortem donors. Genetic associations for gene expression and splicing in cis and trans demonstrate that regulatory associations are found for almost all genes, and contribute contribution to allelic heterogeneity and pleiotropy of complex traits. Importantly, we have identified associations that vary by tissue, cell-type composition, sex, and donor ancestry. We have focused efforts on understanding the impact of biological sex on the transcriptome to investigate the possible basis of the observation that many complex human phenotypes exhibit sex-differentiated characteristics. Here, I will present an extensive catalog comprising sex differences in gene expression and its genetic regulation in the GTEx data. This work demonstrates that sex strongly influences gene expression levels and cellular composition of tissue samples across the human body. The effect of sex on gene expression is widespread, suggesting that many, if not most, biological processes are impacted by sex effects on the transcriptome. We expand the identification of cis-eQTLs with sex-differentiated effects by performing a genotype-by-sex interaction eQTL analysis and identified 369 sex-biased eQTLs (sb-eQTLs). By integrating sb-eQTLs with genome-wide association study data, we identify dozens of gene-trait associations that are driven by genetic regulation in a single sex, including novel associations not detected with sex-agnostic approaches. Using the GTEx data as a reference, we have begun an analogous analysis of sex differences in the transcriptome across 25 tumor-types from The Cancer Genome Atlas (TCGA). I will discuss challenges encountered when applying the same statistical approaches to the heterogenous cancer transcriptomes, and describe some possible solutions. Collectively, our integrative analyses provide the most comprehensive characterization of the human transcriptome across tissues and by sex to date, with important implications for complex traits. Our newest efforts with TCGA data are revealing important benchmarking of statistical approaches for removing unwanted sources of variation in heterogenous data, using tumor transcriptomics data as an example.
April 8, 2022: Model misspecification in microbiome studies; Assistant Professor Amy Willis, Department of Biostatistics, University of Washington
- Seminar Poster
- Abstract: The composition of bacterial taxa in a microbiome is an important parameter to estimate given the critical role that microbiomes play in human and environmental health. By analyzing data from artificially constructed microbiomes of known composition, we show that high-throughput sequencing distorts the true composition of microbial communities. We propose a statistical model for microbiome data that reflects this observation, and algorithms to estimate model parameters. We conclude with examples of the utility of the method, and recommendations for the design and analysis of microbiome studies.
April 1, 2022: Integrating Multimodal Data to Identify Differences in Farm and Nonfarm Cohorts in Early Childhood; Assistant Professor Irene Ong, Department of Biostatistics & Medical Informatics and Department of Obstetrics & Gynecology, University of Wisconsin-Madison
- Seminar Poster
- Abstract: The inception of immune mediated disorders, which have increased worldwide, typically occurs during early childhood and leads to chronic and lifelong diseases. Recent reports show that children exposed to microbes from pets or farm animals, or from traditional communities such as the Amish, have low and lower rates respectively, of allergic and immune related diseases. The infant gut microbiome, particularly during the first 100 days of life, influences development of neonatal immunity, however, the precise microbes and composition that differentiate remain unknown. We compared stool microbiomes from Wisconsin infants from three levels of farm-related lifestyles: traditionally farming Amish, rural farming, and rural non-farming. We hypothesized that the gut microbiota communities of the groups would vary with the level of farming exposures, and that the Amish cohort would harbor unique microbes compared to the non-Amish infants.
March 25, 2022: A Journey of Understanding Nonignorable Missingness and Some Reflections, Assistant Professor Jiwei Zhao, Department of Biostatistics & Medical Informatics
- Seminar Poster
- Abstract: Nonignorable missing data are commonly seen in biomedical studies and social sciences research; however, developing solid statistical methodology has been notoriously difficult due to some intrinsic challenges of nonignorability. In this talk, I will first introduce some fundamental differences between ignorability and nonignorability. Then, I will present two methods which can produce asymptotically valid estimates under nonignorability: the first is based on the conditional independence assumption while the second relies on the geometric structure of the semiparametric model. I will also point out some connections between the two methods. Finally, I will discuss some relations of this line of research with sensitivity analysis as well as some statistical problems under distribution shifts.
March 18, 2022: no seminar – spring break
March 11, 2022: Multi-modal data integration, interpretation, and prediction for understanding brain functional genomics, Assistant Professor Daifeng Wang, Department of Biostatistics & Medical Informatics
- Abstract: Robust phenotype-genotype associations have been established for brains and brain disorders. However, understanding the cellular and molecular causes from genotype to phenotype remains elusive. To this end, recent scientific projects have generated large multi-modal datasets such as various omics data at the single-cell and bulk-tissue levels. However, integrating these large-scale multi-modal data and discovering underlying functional mechanisms are still challenging. To address these challenges, machine learning has been broadly applied to analyze and interpret multi-modal data. In this talk, I will first introduce multiview learning—an emerging machine learning field—and envision its potentially powerful applications for integrating and interpreting multi-modalities. In particular, we have proposed a framework called multiview empirical risk minimization (MV-ERM) for learning multi-modal data heterogeneity and revealing cross-modal patterns. Second, I will introduce our recent multiview learning applications to emerging single-cell multi-modal data in brains.
For instance, recent Patch-seq data reveal multi-scale characteristics of neuronal cells, such as transcriptomics, morphology, and electrophysiology. We benchmarked multiple machine learning methods for data integration to align gene expression and electrophysiological data of single neuronal cells in the mouse brain from the BRAIN Initiative. We found that nonlinear manifold learning outperforms other methods. After manifold alignment, the cells form clusters corresponding to transcriptomic and morphological cell types, suggesting a strong nonlinear relationship between gene expression and electrophysiology at the cell-type level. To further understand how genes and electrophysiology work together in different cellular phenotypes, we developed an interpretable regularized learning model, deepManReg, to predict cellular phenotypes from single-cell multi-modal data. deepManReg employs deep neural networks to learn cross-modal manifolds and then align multi-modal features onto a common latent space. Also, deepManReg uses cross-modal manifolds as a feature graph to regularize the classifiers to improve phenotype predictions and prioritize the multi-modal features and cross-modal interactions for the phenotypes. We applied deepManReg to the transcriptomics and electrophysiological data for neuronal cells in the mouse visual cortex. We show that deepManReg improves the prediction of cellular phenotypes such as cortical layers and prioritizes genes and electrophysiological features for the phenotypes. If time permits, I will briefly introduce other works on genotype-phenotype prediction via multiomics for brain disorders.
February 25, 2022: Reimagining gene-environment interaction in the omnigenic era, with BMI Assistant Professor Qiongshi Lu
- Seminar poster
- Abstract: The environments are often ignored or treated as nuisance parameters in human complex trait genetics research. However, in epidemiology, social sciences, and clinical research, there is a great interest in quantifying the heterogeneity of the effect of an exposure (e.g., a treatment, a major policy change, a natural experiment), and more specifically, how it interacts with genetics. However, the typical statistical methodology used in gene-environment (GxE) interaction analysis (i.e., linear models with main effects of G and E and the interaction GxE) has a number of limitations, especially in the ‘omnigenic’ era (we have now realized that most human traits have a large number of non-zero but weak genetic effects). In this talk, I will introduce several recent statistical advances that reimagine the GxE analysis for ‘omnigenic’ human traits. First, I will introduce QUAIL, a novel, quantile-regression-based framework to identify genetic variants associated with the variability (rather than the mean) of human traits. I will demonstrate that robust findings of variance quantitative trait loci (vQTL) can effectively prioritize candidate genetic variants in GxE studies, and polygenic scores produced from vQTL effects (vPGS) can aggregate information across numerous genetic loci and improve both statistical power and biological interpretability of GxE studies. Next, I will discuss very recent work that links two seemingly unrelated topics: GxE interaction and genetic correlation estimation. I will illustrate that current tools used for genetic correlation estimation provide an ideal alternative strategy for quantifying GxE interactions and will have a number of advantages compared to a traditional linear model with interaction effects. I will show plenty of empirical examples that involve body mass index, education reform in the UK, and sex differences of the genetic basis of many complex traits to showcase the performance of these new statistical advances. Overall, these new tools address critical limitations in existing methodologies and may have broad applications in future GxE studies.
February 23, 2022 – Special Seminar: How Statisticians can get Involved with Multi-Center Clinical Trials – Q&A after the talk – with BMI founding chair, Dr. Dave DeMets
February 18, 2022: Decorrelated Local Linear Estimator: Inference for Non-linear Effects in High-dimensional Additive Models, Zijian Guo, Assistant Professor, Rutgers University
- Additive models play an essential role in studying non-linear relationships. Despite many recent advances in estimation, there is a lack of methods and theories for inference in high-dimensional additive models, including confidence interval construction and hypothesis testing. Motivated by inference for non-linear treatment effects, we consider the high-dimensional additive model and make inference for the derivative of the function of interest. We propose a novel decorrelated local linear estimator and establish its asymptotic normality. The asymptotic variance of our proposed estimator matches with the optimal rate in the univariate setting. The main novelty is the construction of the decorrelation weights, which is instrumental in reducing the error inherited from estimating the high-dimensional additive model. We construct the confidence interval for the function derivative and conduct the related hypothesis testing. We demonstrate our proposed method over large-scale simulation studies and apply it to motif regression. This is based on joint work with Wei Yuan and Cun-Hui Zhang.
February 11, 2022: An Introduction to the Life of Biostatisticians at Vertex with Josh Chen, PhD; Tu Xu, PhD; and Yaohua Zhang, PhD of Vertex (virtual)
- Vertex is among the most innovative companies in the pharmaceutical industry, having discovered and developed the only approved therapies for cystic fibrosis, treating the underlying cause of the disease, and is now expanding to multiple therapeutic areas such as cutting-edge cell and genetic therapies.
- Biostatistics at Vertex is a scientific core development team excelling in creative development strategies, innovative clinical trial designs, scientifically appropriate results interpretation, quantitative regulatory interaction, and statistical methodology advancement.
January 28, 2022: Learning Individualized Treatment Rule for a Target Population with Guanhua Chen, BMI (virtual)
- Current literature focuses on deriving individualized treatment rules (ITRs) from a single source population. We consider the setting when the source population may differ from the target population of interest. This problem is also related to transfer learning in the machine learning community. We assume subject covariates are available from both populations, but treatment and outcome data are only available from the source population. Although adjusting for differences between source and target populations can potentially lead to an improved ITR for the target population, it can substantially increase the variability in ITR estimation. To address this dilemma, we develop a weighting framework that aims to tailor an ITR for a given target population and protect against high variability due to superfluous covariate shift adjustments. Our method seeks covariate balance over a nonparametric function class characterized by a reproducing kernel Hilbert space. We show that the proposed method encompasses the so-called importance weights and overlap weights as two extreme cases, allowing for a better bias-variance trade-off. Numerical examples demonstrate that using our weighting methods greatly improves ITR estimation for the target population compared with other weighting methods.