See the Campus Event Calendar for details about upcoming seminars
Day and Time: Fridays from noon to 1 pm central time unless otherwise noted.
Location: UW Biotechnology Center Auditorium
Zoom: https://uwmadison.zoom.us/j/99879638765.
Passcode: 343271
Further details should be available about a week before the seminar.
To subscribe to the BMI Seminar mailing list email join-biostat-seminar@lists.wisc.edu.
Upcoming Spring 2025 Seminars
May 2, 2025
Title: Spring 2025 Rotation Presentations, Biomedical Data Science PhD Program
Zoom (NOTE DIFFERENT ZOOM LINK): https://uwmadison.zoom.us/j/95932720274
Speaker: Olivia Johnson, BDS PhD Student
Rotation Mentor: Juan Caicedo
Title: Nuclei Segmentation of 3D Organoid Microscopy Images
Speaker: Jihua Liu, BDS PhD Student
Rotation Mentor: Sunduz Keles
Title: Cell Clustering and Separation of Leukemia Multi-omics Single-cell Dataset
Speaker: Jin Mu, BDS PhD Student
Rotation Mentor: Guanhua Chen
Title: Benchmarking Machine Learning and Foundation Models for Microbiome-Based Disease Prediction
Speaker: Swathisri Venkatesh, BDS PhD Student
Rotation Mentor: Sunduz Keles
Title: Examining Multi-way Chromatin Interactions from Droplet Hi-C Data
Speaker: Brendan Joyce, BDS PhD Student
Rotation Mentor: Sushmita Roy
Title: Benchmarking Effectiveness of Single Cell Multimodal Integration Methods
Speaker: Peng Wu, BDS PhD Student
Rotation Mentor: Tom Cook
Title: Non-Fatal Event Analysis with Competing Risk of Death
Completed 2024/25 Seminars
September 6, 2024:
Speaker: Moo Chung, Dept of Biostatistics & Medical Informatics, UW-Madison
Title: Aligning Asynchronous Network Data through Persistent Homology
Abstract: We introduce a novel topological data analysis (TDA) approach for aligning asynchronous dynamic networks over time. Our method leverages persistent homology, which decomposes 0D topological features (connected components) and 1D topological features (loops) orthogonally. This decomposition enables the exact computation of the Wasserstein distance, a probabilistic version of optimal transport, into a squared Euclidean distance form with O(n log n) run time. Our scalable approach allows for localized matching of networks at the edge level, facilitating precise inference and learning. This method can reduce statistical variability by up to 500 times, enabling the detection of signals previously undetectable. We demonstrate the application of this method in aligning asynchronous human functional brain networks obtained from resting-state functional magnetic resonance imaging (rs-fMRI). Human brain activity at rest does not synchronize across subjects, making direct comparisons nearly impossible and posing a significant challenge to the clinical relevance of rs-fMRI. Our approach addresses this challenge by providing a workable solution that performs topological registration of time-varying networks. This talk is partially based on arXiv:2012.00675 (Annals of Applied Statistics) and arXiv:2201.00087 (PLOS Computational Biology).
September 13, 2024: Biomedical Data Science Student Summer 2024 Rotation Presentations
John Peters, BDS PhD student
- Title: Forging METL Stronger: Model User Experience
- Rotation Mentor: Professor Tony Gitter
Aurod Ounsinegad, BDS PhD student
- Title: Nonnegative Matrix Factorization Through Cone Collapsing
- Rotation Mentor: Professor Daniel Pimental Alarcon
Zhongxuan Sun, BDS PhD student
- Title: Causal regulatory network inference from Perturb-seq data
- Rotation Mentor: Professor Hyunseung Kang
Livvy Johnson, BDS PhD student
- Title: Predicting Gene Regulatory Networks for Leukemia Cell Clusters
- Rotation Mentor: Professor Sushmita Roy
Leo Jin, BDS MS student
- Title: A graph-based learning approach to predict the effects of gene perturbations on molecular phenotypes
- Rotation Mentor: Professor Mark Craven
September 20, 2024
Speaker: Will Rosenberger, George Mason University
Title: Casual Inference for Clinical Trails*: A Spellchecker’s Guide to Randomization Tests in Complex Settings
Abstract:
Sir Austin Bradford Hill, the developer of the first randomized clinical trial, was a proponent of simplicity in statistical analysis, and strongly emphasized careful study design as the critical component of all medical studies. While he didn’t mention randomization tests in his 1937 book, I believe he would have liked their simplicity and interpretability. Any inference procedure which assumes random sampling from a population ignores Fisherian principles regarding the analysis of designed experiments. And clinical trials are the quintessential designed experiment. While we hear quite often about preservation of type I error rates and, more recently, about causal inference, these are natural elements of a randomization test. We discuss these issues and demonstrate that randomization tests can be used for more complex settings, such as multiple (>2) treatment comparisons, analyses with missing outcome data, and subgroup analyses. It is interesting to note that the only cohort of statisticians NOT excited about randomization tests in this age of causal inference are the designers and conductors of randomized clinical trials! I will conclude with a few historical notes about Fisher and de Finetti.
*The two most often misspelled words during my term as Biometrics co-editor.
September 27, 2024 – CANCELLED
Speaker: Anuj Srivastava, Florida State University
Title: Statistical Shape Analysis of Complex Natural Structures
Zoom: https://uwmadison.zoom.us/j/9761550901
Abstract: Statistical modeling and analysis of structured data is a fast-growing field in Statistics and Data Science. Rapid advances in imaging techniques have led to tremendous amounts of data for analyzing imaged objects across several scientific disciplines. Examples include shapes of cancer cells, botanical trees, human biometrics, 3D genome, brain anatomical structures, crowd videos, nano-manufacturing, and so on. Shapes are relevant even in non-imaging data contexts, e.g., the shapes of COVID rate curves or the shapes of activity cycles in lifestyle data. Imposing statistical models and inferences on shapes seems daunting because the shape is an abstract notion and one requires precise mathematical representations to quantify shapes.
This talk has two parts. In the first part, I will present some recent developments in “elastic representations” of structures such as functions, curves, surfaces, and graphs. In the second part, I will focus on statistical analyses: computing shape summaries, estimation under shape constraints, hypothesis testing, time-series models, and regression models involving shapes.
October 4, 2024:
NOTE – LOCATION CHANGE – 1248 HSLC
Speaker: Daniel Pimentel-Alarcon, Dept of Biostatistics & Medical Informatics, UW-Madison
Title: Unsupervised Learning from Messy Data
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: In this talk I will discuss some of the main challenges posed by small sample sizes, missing values, outliers, skewed groups, and sparsity patterns in Machine Learning, specifically in the unsupervised and semi-supervised setting. Most of these issues arise in modern applications of science, ranging from meta genomics to astronomy. I will also share some recent discoveries and strategies to mitigate these issues and a glimpse of some theoretical developments that might pave the road to the much needed and coveted understanding of deep learning.
October 11, 2024
Speaker: Ava Amini, Microsoft
Title: Bridging Biophysics and AI to Optimize Protein Design
Virtual Only
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract:
Engineered proteins play increasingly essential roles in applications spanning pharmaceuticals, molecular tools, synthetic biology, and more. Deep generative models offer the ability to accelerate protein engineering for therapeutic and biological applications. Recently, a family of generative models called diffusion models has demonstrated the potential for unprecedented capability and control in de novo design. In this talk, we introduce biologically-grounded diffusion models for generation of protein structures and sequences.
We first share work in creating a new diffusion-based generative model that designs protein structures by mirroring the biophysics of the native protein folding process. To expand beyond the subset of protein biology captured in structural data, we reasoned that sequence – not structure – could serve as a universal design space for protein generation. We thus developed a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein design in sequence space alone. We envision that these modeling frameworks will enable new capabilities in protein engineering towards programmable, functional design.
October 18, 2024:
Speaker: Yuan Ji, University of Chicago
Title: A New Nonparametrics Bayesian Models for Grouped Data with Applications to Clinical Trials Borrowing External Data
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: We consider a class of nonparametric Bayesian models called the Shared Atoms Model (SAM) based on the dependent Dirichlet processes. SAM uses a simple idea of atom skipping to generate group specific patterns of clustering, allow atoms to be common, unique, and shared across multiple groups. In finite data sets, SAM clusters experimental units in a similar fashion, facilitating interpretable and flexible inference. We consider an application of SAM to a real-world data set with measurements of patients of atopic dermatitis.
October 25, 2024
Speaker: Ting Ye, University of Washington
Title: From Estimands to Robust Interference of Treatment Effect in Platform Trials
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: A platform trial is an innovative clinical trial design that uses a master protocol (i.e., one overarching protocol) to evaluate multiple treatments in an ongoing manner and can accelerate the evaluation of new treatments. However, the flexibility that marks the potential of platform trials also creates inferential challenges. Two fundamental challenges are the precise definition of treatment effects and the robust and efficient inference on these effects. In this work, we make a key contribution by, for the first time, clearly stating how to construct a clinically meaningful estimand. This estimand characterizes the treatment effect as a contrast of the expected outcomes between two treatments in a population of concurrently eligible participants—the largest population that preserves the integrity of randomization. Then, we develop weighting and post-stratification methods for estimation of treatment effects with minimal assumptions. To fully leverage the efficiency potential of data from concurrently eligible participants, we also consider a model-assisted approach for baseline covariate adjustment to gain efficiency while maintaining robustness against model misspecification. We derive and compare asymptotic distributions of proposed estimators in theory and propose robust variance estimators. The proposed estimators are empirically evaluated in a simulation study and illustrated using the SIMPLIFY trial.
November 1, 2024
Speaker: Hannah Wayment-Steele, Visiting Assistant Professor, Dept of Biochemistry
Title: Predicting and Discovering Protein Dynamics
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: The functions of biomolecules are often based in their ability to convert between multiple conformations. Recent advances in deep learning for predicting and designing single structures of proteins mean that the next frontier lies in how well we can characterize, model, and predict protein dynamics. In the first part of my talk, I will describe a simple adaptation of AlphaFold to predict multiple conformations. Combining the resulting “AFCluster” method and NMR dynamics experiments allowed us to learn more about the complete conformational landscape of KaiB, and how the slow interconversion that biology necessitates for circadian rhythms is encoded in its sequence. However, a major bottleneck for the field of predicting dynamics has been a lack of standardized datasets of experimental measurements of the timescales of protein motions, and especially those on a micro-millisecond timescale where many biologically-relevant processes occur. In the second part of my talk, I will describe the development of large-scale benchmarks of dynamics from across multiple types of NMR experiments, and initial insights if it might already be possible to predict the presence of biologically-relevant motions.
November 7, 2024: DeMets Lectures: Janet Wittes, Wittes LLS
November 8, 2024: DeMets Lectures: Janet Wittes, Wittes LLS
November 15, 2024:
Speaker: Yaping Liu, Northwestern University
Title: Decoding the human genome by multi-omics in cell-free DNA and single-cells
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: Epigenetic modifications, including DNA methylation, histone modifications, and three-dimensional (3D) genome topology, combine with genetic content to determine the mammalian transcriptional factor (TF) binding and, thus, gene regulation. At present, we are limited by the number of simultaneous measurements that we can perform in the same DNA molecules and single cells. We developed single-cell Methyl-HiC to reveal the heterogeneity of DNA methylation, long-range DNA methylation concordance, and 3D genome in the same cells.
Recently, we improved this technology to jointly profile genetic variants, DNA methylation, chromatin accessibility, and 3D genome in the same DNA molecules and in single cells at both cell lines and flash-frozen tissues.
To non-invasively monitor the dynamics of regulatory elements in vivo, we developed a set of computational methods to study the cellular epigenomes from cell-free DNA (cfDNA) fragmentation patterns. Specifically, we developed a computational method to de novo characterize the genome-wide cfDNA fragmentation hotspots, infer the open chromatin regions within cells, and boost the power for the cancer early detection. We also developed a computational model to accurately predict DNA methylation and identify the tissues-of-origin in cfDNA from both high-coverage and low-coverage cfDNA whole-genome sequencing.
The experimental approaches in single-cell multi-omics and computational methods in cell-free DNA epigenomics developed in our lab will eventually pave the road for our understanding of the variation of cis-regulatory elements non-invasively across different physiological and pathological conditions.
November 22, 2024:
Speaker: Jennifer Clark Nelson (Kaiser Permanente Washington Health Research Institute
Title: Statistical methods for improving post-licensure vaccine safety surveillance.
Zoom: https://uwmadison.zoom.us/j/97615509019
Abstract: Improving statistical methods for post-licensure vaccine safety surveillance is critical for safeguarding public health and maintaining public trust in vaccination programs. This is especially important during pandemics like COVID-19 when vaccines are administered on a global scale at unprecedented speed. Many national vaccine safety surveillance efforts use electronic health records and insurance claims data from large multi-site health care data networks. I will summarize challenges that can arise when using these secondary data sources to conduct safety studies. I will also discuss statistical approaches designed to better detect rare adverse events in these settings. These include 1) adapting sequential methods from clinical trials to this observational database setting in order to ensure more rapid detection and 2) using natural language processing of clinical notes in combination with machine learning methods to improve the accuracy with which vaccine safety outcomes are identified. I will illustrate methods using example safety questions that have arisen within FDA’s Sentinel Initiative and the CDC’s Vaccine Safety Datalink monitoring systems.
November 29, 2024: No seminar – Thanksgiving holiday
December 5, 2024: Biomedical Data Science Student Fall 2024 Rotation Presentations
Zoom: https://uwmadison.zoom.us/j/94802702215
Location Change: 408 Service Memorial Institute (SMI)
Poster: Fall 2024 Rotation Presentations
Speakers and Titles:
- James Haddad – Utilizing clinical data to predict hepatic deterioration over time
- Rotation Mentor – Matthew Churpek
- Shan Leng – Epigenetic Aging: Literature review and unsupervised solution
- Rotation Mentor: Qiongshi Lu
- Mike Fromandi – Exploring shared information in biomedical data
- Rotation Mentors: JP Yu and Daniel Pimentel-Alarcon
- JJ Liu – CASSIA
- Rotation Mentor: Christina Kendziorski
- Xiaoxu Rong – Diagnosis of neonatal encephalopathy (NE) based on CTG data
- Rotation Mentor: Daniel Pimentel-Alarcon
December 6, 2024: Biomedical Data Science Student Fall 2024 Rotation Presentations
Zoom: https://uwmadison.zoom.us/j/94802702215
Location: UW Biotechnology Center Building
Poster:
Zoom: https://uwmadison.zoom.us/j/94802702215
Poster: Fall 2024 Rotation Presentations
Speakers and Titles:
- Leo Jin – Some investigations on the sequence-to-function model Enformer
- Rotation Mentor: Qiongshi Lu
- Swathisri Venkatesh – Examining the gene regulatory programs of compensatory renal growth
- Rotation Mentor: Sushmita Roy
- Brendan Joyce – Unpacking the impact of pediatric microbiome on immune system development
- Rotation Mentor: Irene Ong
- Jin Mu – Transfer learning and genetics-powered estimation of heterogeneous treatment effect
- Rotation Mentor: Qiongshi Lu
- Peng Wu – A clinically intuitive approach to evaluating the performance of early warning scores
- Rotation Mentor: Anoop Mayampurath
- Livvy Johnson – Survival Analysis of VEGF/VEGFR inhibitor treatment on TP53-mutant cancer patients.
- Rotation Mentor: Irene Ong
Jan 24, 2025
Speaker: Irena (Irene). B. Helenowski, Loyola University Chicago
Title: Recent Applications and Validation of Nomograms Including an Example for a Repeated Measures Cause-Specific Cox Regression Model
Abstract: Nomograms are one tool which a1llowsu s to predict absolute risk of an event based on demographic and clinical factors fitted to a logistic regression or time-dependent Cox regression model. This tool can be further extended to longitudinal settings, as a generalized estimating equations (GEE) model or Cox regression model accounting for repeated measures. In this presentation, we give an example of a nomogram applied to liver transplant data associating the Model for End-Stage liver Disease (MELD) score to time to death on waitlist based on a cause-specific Cox regression model, adapted from Birock et al. (2019) involving repeated measures. We also discuss validation approaches for the nomogram, including the Dxy,U , Q, and g statistics, ROC analysis, and calibration plots. we hope to prove that review and application of these approaches associated with several types of modeIs continue to be of paramount use for clinicians and future implementation of this tool warrants further research.
Jan 31, 2025
Speaker: Jiyang Yu, St. Jude Children’s Research Hospital
Title: Spotiphy enables single-cell spatial whole transcriptomics via generative modeling
Abstract: Spatial transcriptomics (ST) has advanced our understanding of tissue regionalization by enabling the visualization of gene expression within whole tissue sections, but the approach remains dogged by the challenge of achieving single-cell resolution without sacrificing whole genome coverage. Here we present Spotiphy (Spot imager with pseudo single-cell resolution histology), a novel computational toolkit that transforms sequencing-based ST data into single-cell-resolved whole-transcriptome images via generative modeling. In evaluations with Alzheimer’s disease (AD) and normal mouse brains, Spotiphy delivers the most precise cellular compositions. For the first time, Spotiphy reveals novel astrocyte regional specification in mouse brains. It distinguishes sub-populations of DAM (Disease-Associated Microglia) located in different AD mouse brain regions. Spotiphy also identifies multiple spatial domains as well as changes in the patterns of tumor-tumor microenvironment interactions using human breast ST data. Spotiphy enables visualization of cell localization and gene expression in tissue sections, offering key insights into the function of complex biological systems.
Feb 7, 2025
Speaker: Ming Jiang, Indiana University-Indianapolis
Title: Beyond Accuracy: Building Human-centered Trustworthy Language Technologies
Abstract: Recent advances in natural language processing (NLP), particularly the rise of large language models (LLMs), have established them as vital AI tools across diverse domains such as communication, knowledge discovery, and complex decision-making. Despite their potential, NLP systems still face critical trustworthiness challenges particularly in high-stake, human-centered applications (e.g., personalized medicine, drug discovery, AI-assisted diagnostics), mostly due to oversimplified assumptions about human factors during model development, assessment, and deployment. This talk will showcase our research efforts to overcome these obstacles, focusing on three core themes: (1) exploring the potential of LLMs to bridge critical knowledge gaps across diverse communities; (2) designing human-aligned assessment methodologies to rigorously quantify the capabilities of vision-language generative models; and (3) developing a highly adaptable and interpretable LLM evaluation framework for diverse generation tasks. Looking ahead, I will share our vision for advancing trustworthy language technologies that can drive real-world impact in critical, human-centric applications such as public health and precision medicine.
Feb 14, 2025
Speaker: Anuj Srivastava, Florida State University
Title: Statistical Shape Analysis of Complex Natural Structures
Abstract: Statistical modeling and analysis of structured data is a fast-growing field in Statistics and Data Science. Rapid advances in imaging techniques have led to tremendous amounts of data for analyzing imaged subjects across several scientific disciplines. Examples include shapes of cancer cells, botanical trees, human biometrics, 3D genome, brain anatomical structures, crowd videos, nano-manufacturing, and so on. Shapes are relevant even in non-imaging data contexts, e.g., the shapes of COVID rate curves or the shapes of activity cycles in lifestyle data. Imposing statistical models and inferences on shapes seems daunting because the shape is an abstract notion and one requires precise mathematical representations to quantify shapes. This talk has two parts. In the first part, Dr. Srivastava will present recent developments in “elastic representations” of structures such as functions, curves, surfaces, and graphs. In the second part, he will focus on statistical analyses: computing shape summaries, estimation under shape constraints, hypothesis testing, time-series models, and regression models involving shapes.
Feb 21, 2025
Yu Shen, PhD, University of Texas MD Anderson Cancer Center
Title: Integrating Multiple Data Sources: Enhancing Precision Medicine Risk Prediction
Date: Friday, February 21, 2025
Time: Noon to 1 pm
Location: Biotech Center Auditorium or Zoom
Zoom: https://uwmadison.zoom.us/j/99879638765
–passcode 343271
Abstract: Large cancer registries have become widely available in clinical research as a complement to improving the estimation of the precision of individual death risks for cancer patients. In particular, for rare types of cancer, it is desirable to combine multiple sources of data, such as primary cohort data and aggregate information derived from cancer registry databases. This integration of data can enhance statistical efficiency and accuracy in risk prediction, but it also presents statistical challenges due to the incomparability between different data sources. We develop adaptive estimation procedures that use the combined information to determine the degree of information borrowing from the aggregate data of the external resource. We apply the proposed methods to evaluate the long-term effects of several commonly used treatments for inflammatory breast cancer by tumor subtype, combining the MD Anderson inflammatory breast cancer patient cohort with external data.
Feb 28, 2025
Speaker: Chi Zhang, Oregon Health and Science University (OHSU)
Title: Advancing data-driven systems biology approaches to study metabolic variations in cancer
New Location: 1345 HSLC
Abstract: The functional activities of biological systems include both intracellular functions such as transcriptional regulation, metabolism, and signaling transduction, and as well as intercellular activities such as cell-microenvironment and cell-cell interactions. In recent years, we focused on developing new systems biology approaches to quantify metabolic activities and metabolic cross-talks in disease tissue microenvironment using omics data. We developed a research framework named, “data-driven and AI empowered systems biology”, which aims to quantify biological processes and approximate its dynamic property using non-time course omics data. We have established mathematical foundations, including computational principles, new learning functions, optimizers, and relevant theories, and generalized its application from metabolic system to biosynthesis and processing of large molecules, transcriptional regulation, signaling transduction and cell-tissue interactions, as well as enabled the usage of multi-omics data. By applying this approach to different disease systems, we identified new drug targets to improve the efficacy of immunotherapy for cancer treatment and predicted the trend of metabolic shifts throughout cancer progression.
March 7, 2025
Speaker: Sameer Deshpande, UW-Madison Department of Statistics
Title: Scalable piecewise smoothing in high dimensions with BART
Abstract: Bayesian Additive Regression Trees (BART) is an easy-to-use and highly effective nonparametric regression model that approximates unknown functions with a sum of binary regression trees (i.e., piecewise-constant step functions) that one-hot encode categorical predictors. Consequently, BART is fundamentally limited in its ability to “borrow strength” across multiple categories and to estimate smooth functions. Initial attempts to overcome this second limitation replaced the constant output in each leaf of a tree with a realization of a Gaussian Process (GP). While these elaborations are conceptually elegant, most implementations are computationally prohibitive, displaying a cubic per-iteration complexity.
We propose ridgeBART, an extension of BART built with trees that (i) can assign multiple categorical levels to both branches of a decision tree node and (ii) output linear combinations of ridge functions (i.e., a composition of an affine transformation of the inputs and non-linearity. We develop a new MCMC sampler that updates trees in linear time and derive near-minimax-optimal posterior contraction rates for estimating smooth and piecewise smooth functions. We demonstrate ridgeBART’s effectiveness on synthetic data and use it to estimate the probability that a professional basketball player makes a shot from any location on the court in a spatially smooth fashion.
The talk is based on the following papers:
https://arxiv.org/abs/2411.07984 and https://arxiv.org/abs/2211.04459
March 14, 2025
Speaker: Hua Zhou, University of California Los Angeles
Title: MM Optimization Algorithms for Analyzing Big Biomedical Data
Abstract: The majorization-minimization (MM) principle is an extremely general framework for deriving optimization algorithms. It includes the expectation maximization (EM) algorithm, proximal gradient algorithm, concave-convex procedure, quadratic lower bound algorithm, and proximal distance algorithm as special cases. Besides numerous applications in statistics, optimization, and imaging, the MM principle finds wide applications in large-scale machine learning problems such as matrix completion, discriminant analysis, and nonnegative matrix factorizations. This talk presents some novel applications of the MM principle in the big data setting. We derive a parallel block least squares algorithm that allows parallel update of regression coefficients with a large feature matrix partitioned by columns. We introduce a deweighting technique for weighted least squares that dramatically accelerates the fitting of generalized linear models and quantile regression. We also present an MM algorithm for fitting large-scale variance component model that provably converges faster than the classical EM algorithm.
March 21, 2025
Speaker: Leng Han, Indiana University
Title: Harnessing big data for precision medicine
Abstract: Despite advancements in treatment options for cancer, a majority of cancer types continue to lack fully characterized and effective targeted therapies to improve disease diagnostics, prognoses, and patient survival outcomes. Therefore, there is an urgent need to gain a more comprehensive understanding of the molecular basis of diseases and develop novel prognostic and therapeutic strategies. Our lab utilizes cutting-edge techniques in systems biology to understand the molecular mechanisms of complex diseases. We conducted a series of pan-cancer analyses to provide clinical insights into cancer therapy, including RNA targeted therapy (Journal of the National Cancer Institute, 2018; Genome Medicine, 2019a; Genome Medicine, 2019b; Nature Communications, 2019; Cancer Research, 2022), chronotherapy (Cell Systems, 2018), hypoxia-targeted therapy (Nature Metabolism, 2019), target therapy (Genome Medicine, 2020a), autophagy-targeted therapy (Nature Communications, 2022), and immunotherapy (Nature Immunology, 2019; Nature Communications, 2020a; Nature Communications, 2020b; Genome Medicine, 2020b; Advanced Science, 2020; Journal of the National Cancer Institute, 2021; Cancer Cell, 2021; The Innovation, 2021; Journal for Immunotherapy of Cancer, 2022; Nature Reviews Clinical Oncology, 2022; Cell Metabolism, 2023; The Innovation, 2023; Nature Reviews Clinical Oncology, 2023). These studies shed light on future clinical considerations for the development of innovative therapies for cancer types currently lacking effective treatment options. We will further develop highly innovative prognostic and therapeutic strategies with the potential to produce a major impact on biomedical research.
March 28, 2025 – spring break
April 4, 2025
Speaker: Yiqiao Zhong, University of Wisconsin-Madison
Title: Do you interpret your t-SNE embeddings correctly? A perspective from map-continuity and leave-one-out.
Abstract: Neighbor embedding methods such as t-SNE, UMAP, and LargeVis are widely used for visualizing high-dimensional data. A common belief is that these methods serve as nonlinear dimension reduction tools which, similar to PCA, learn low-dimensional manifold structures from the data.
In this talk, I will present evidence to show that this view is inaccurate: the embedding maps of t-SNE, UMAP, and LargeVis can exhibit discontinuity points, leading to unintended topological distortions. A key challenge in analyzing these visualization methods is that the embedding points are obtained by solving highly complicated optimization problems. To address this, I’ll introduce the leave-one-out (LOO) surrogate, or LOO-map, which captures the properties of embedding maps. Our analysis identifies two types of discontinuity patterns: (1) global discontinuities, which promote artificial cluster structures, and (2) local discontinuities, which promotes subclusters. To mitigate these issues, I’ll propose two diagnostic pointwise scores that help detect out-of-distribution samples in deep learning and assisting hyperparameter tuning in single-cell data analysis.
This talk is based on a joint work with Zhexuan Liu (3rd-year Stats PhD student) and Rong Ma (Harvard Biostatistics): arXiv:2410.16608.
April 11, 2025
Speaker: Wenjin Jim Zheng, University of Texas Houston
Title: Leveraging LLMs and AI Agent Networks for Community-based Gene Set and Cell Type Annotation
Abstract: Single-cell RAN sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures – especially those involving poorly characterized genes – remains a major challenge. Traditional gene set analysis methods, such as Gene Set Enrichment Analysis (GSEA), rely heavily on well-curated annotations and often underperform in such contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present a novel approach that integrates free-text descriptions with ontology labels for more accurate and robust gene set annotation. Our method outperforms state-of-the-art tools, correctly annotating over 68% of gene sets within the top five predictions. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature to reduce hallucinations and enhance interpretability. Using this workflow, we annotated 5,322 brain cell clusters from the complete mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, creating a valuable resource to support community-drive cell type annotation efforts.
April 18, 2025 – Zhengling Qi, George Washington University
April 25, 2024
Speaker: Zhiyong Lu, National Institutes of Health
Title: Large Language Models in Biomedicine: from TrialGPT to GeneAgent
Abstract: The explosion of biomedical big data and information in the past decade or so has created new opportunities for discoveries to improve the treatment and prevention of human diseases. As such, the field of medicine is undergoing a paradigm shift driven by AI-powered analytical solutions. This talk explores the role of AI and Large Language Models (LLMs) in transforming biomedical discovery and healthcare, through the demonstration of some real-world use cases such as improving PubMed searches (Best Match, Nature Biotechnology, 2018) assisting patient-to-trial matching (TrialGPT, Nature Communications, 2024) and improving gene set analysis (GeneAgent, Nature Methods, 2025). This talk will also address the challenges and limitations of using AI/LLMs in this field.