STAT 548 PhD Qualifying Papers
I am interested in almost all problems in computational biology and genomics. I expect a student to propose novel statistical approaches that can address challenges in data analysis and modelling of high-dimensional, large-volume biological problems. In many cases, I can provide relevant data sets and help students find collaborators to facilitate publications in a biological venue. Feel free to contact me (
An expected format of the report (a rough guide)
You may organize your report including the following sections.
Problem definition (1 page): Extract mathematical/statistical problems from the paper and organize them. What are the input data? What is the expected output?
Significance (1-2 paragraphs): Why is this an interesting problem? What can be learned by studying this problem? Why is it exciting for you?
Author contribution (1-2 paragraphs): How did the author(s) find the solution? What was a novel contribution beyond traditional approaches?
Limitations/challenges (1-2 paragraphs): What are the assumptions? Are they realistic? What are the technical limitations that the authors acknowledge or not?
Novel idea/methods (2 pages): Propose your idea and statistical methods.
You could design a new model and implement the inference algorithm as pseudo-code. Of course, your innovation can be totally original. But, in many cases, you can borrow and extend existing ideas and apply them to your own problem.
Your innovation can be theoretical. You could show that the existing method/algorithm indeed works with certain theoretical guarantees.
You could also interpret the underlying problem from a different perspective. E.g., what are potentially-related problems/frameworks, but not adopted by the authors?
Results (1-2 pages): First, include one figure that sketches your idea/approaches. Then, show figures and tables that demonstrate your methods.
Discussion (1 page): Briefly discuss what you have learned and what you would achieve if you were to develop this to a full paper. How would you validate your findings in independent studies, including wet-lab experiments?
Abadie and Imbens, Large sample properties of matching estimators for average treatment effects, Econometrica (2005)
Category: causal inference, single-cell genomics
Idea: Extending the basic idea of this paper, implement a matching-based counterfactual inference method to adjust unknown covariates to estimate causal effects. Test your methods for (causally-) differential expression analysis on simulated/real-world single-cell RNA-seq data. You may also propose novel data normalization, confounder-correction methods that clearly hinge on the notion of causality.
Hill, Bayesian Nonparametric Modeling for Causal Inference (2012); Louizos et al. Causal Effect Inference with Deep Latent-Variable Models (2017)
Category: causal inference, Bayesian inference
Idea: Using counterfactual data modelling/sampling, we can train a model that captures latent representation of confounding factors. Can we combine the idea of a counterfactual model with stochastic variational inference that consumes stochastically-generated minibatch data to update parameters in each epoch? If so, what is optimal sampling strategy to feed in counterfactual data to the inference engine?
Murray, Adams, and MacKay, Elliptical slice sampling (2010)
Category: Bayesian inference, cell type deconvolution
Idea: Multiple directions are possible. (1) Revisit optimization-based non-conjugate, non-analytical models such as this cell type deconvolution model and quantify the level of uncertainty using ESS. Test on simulated data, varying the number of putative cell types and the gap between rare and abundant cell types. (2) Another interesting direction can be an application to black-box stochastic variation inference. You may design and demonstrate a novel stochastic gradient estimation method that can better handle a high-dimensional parametric model than a basic approach.
Category: statistical genetics, summary statistics-based inference.
Idea: Multiple directions are possible. (1) A new model that allows multiple types of summary statistics matrix. (2) A new inference method that takes advantage of both partially-observed individual-level and summary statistics data. (3) Built on this framework, you may formulate stratified LD-score regression model with (Bayesian) sparse prior. (4) Show correspondence with existing polygenic risk prediction methods.
Category: single-cell genomics, dynamics, variational inference
Idea: Propose a likelihood-free inference method for the ordinary differential equation (ODE) model used in Bergen et al., Generalizing RNA velocity to transient cell states through dynamical modeling. You are welcomed to extend the gene-by-gene ODE model to a finite (or infinite) mixture of ODE models. You may like to read Rasmussen and Ghahramani for this. Alternatively, you can design spatiotemporal partial differential equations and estimate posterior distributions of the parameters by the black-box inference algorithm. We can then apply your method to spatial transcriptomic data, such as Integrating microarray-based spatial transcriptomics and single-cell RNA-seq reveals tissue architecture in pancreatic ductal adenocarcinomas.