STAT 548 PhD Qualifying Papers (2022 - 2023)
Introduction
I am interested in almost all problems in computational biology and genomics. I expect a student to propose novel statistical approaches that can address challenges in data analysis and modelling of high-dimensional, large-volume biological problems. In many cases, I can provide relevant data sets and help students find collaborators to facilitate publications in a biological venue. Feel free to contact me (ypp@stat.ubc.ca
).
An expected format for the report
You may organize your report including the following sections.
Problem definition (1 page): Extract mathematical/statistical problems from the paper and organize them. What are the input data? What is the expected output?
Significance (1-2 paragraphs): Why is this an interesting problem? What can be learned by studying this problem? Why is it exciting for you? Author contribution: How did the author(s) find the solution? What was a novel contribution beyond traditional approaches?
Limitations/challenges (1-2 paragraphs): What are the assumptions? Are they realistic? What are the technical limitations that the authors acknowledge or not?
Novel idea/methods (2 pages): Propose your idea and statistical methods. You could interpret the underlying problem in a different formulation. What are related problems/frameworks, but not adopted by the authors?
Results (1-2 pages): Include one figure that sketches your approaches. Show tables and figures that clearly demonstrate your methods.
Discussion (1 page): Briefly discuss what you have learned and what you would achieve if you were to develop this to a full paper. How would you validate your findings in independent studies, including wet-lab experiments?
Available Papers
Romano, Sesia, and Candes, Deep Knockoffs
Category: variable selection, causal inference
Idea: Implement the method using
torch
(R
orPython
) and test on synthetic data. Test a variety of scoring functions and benchmark for genetics applications. Real-world genetics data will be made available upon request.
Gu, Blaauw, and Welch, Variational Mixtures of ODEs for Inferring Cellular Gene Expression Dynamics
Category: single-cell genomics, dynamics, variational inference
Idea: The goal is to understand the method and implementation in detail in a broad context of likelihood-free inference methods. One can compare this paper with relevant methods based on variational inference, such as Ryder et al., Black-box variational inference for stochastic differential equations. Several review papers are available like this.
Zheng, Aragam, Ravikumar, and Xing, DAGs with NO TEARS: Continuous Optimization for Structure Learning TAKEN
Category: combinatorial optimization, causal inference
Idea: Implement the method using
torch
(R
orPython
) and test on synthetic data (e.g., linear Gaussian or generalize linear models). The current framework was primarily built on observational data. You are welcomed to extend the approach to a data set with multiple experimental conditions, so that the method can discover causal relationships implicated by the data-generating scheme.
Jung, Kasiviswanathan, Tian, Janzing, Bloebaum, and Bareinboim, On Measuring Causal Contributions via do-interventions
Category: causal inference
Idea: The paper takes a theoretical (axiomatic) approach to a causal inference problem. What are the differences between the do-Shapley and other conventional approaches? Several routes can be taken: (1) Understand the framework and methods; revisit the theorem and prove in your own language. (2) One can propose an efficient algorithm to handle a large number of variables in do-Shapely computations.
Hill, Bayesian Nonparametric Modeling for Causal Inference (2012); Louizos et al. Causal Effect Inference with Deep Latent-Variable Models (2017)
Category: causal inference, Bayesian inference
Idea: Using counterfactual data modelling/sampling, we can train a model that captures latent representation of confounding factors. Can we combine the idea of a counterfactual model with stochastic variational inference that consumes stochastically-generated minibatch data to update parameters in each epoch? If so, what is optimal sampling strategy to feed in counterfactual data to the inference engine?
Wang, Sarkar, Carbonetto, and Stephens, A simple new approach to variable selection in regression, with application to genetic fine mapping, 2020
Reference site SuSiE
Other related papers: Zhu and Stephens, Bayesian large-scale multiple regression with summary statistics from genome-wide association studies (2017). Zou, Carbonetto, Wang, and Stephens, Fine-mapping from summary data with the “Sum of Single Effects” model (2022)
Category: statistical genetics, summary statistics-based inference.
Idea: Multiple directions are possible. Extend the idea to summary statistics matrix with more than two columns. Show correspondence with existing polygenic risk prediction methods.