STAT 548 PhD Qualifying Papers (2022 - 2023)

Sep 1, 2022 4 min read

Introduction

I am interested in almost all problems in computational biology and genomics. I expect a student to propose novel statistical approaches that can address challenges in data analysis and modelling of high-dimensional, large-volume biological problems. In many cases, I can provide relevant data sets and help students find collaborators to facilitate publications in a biological venue. Feel free to contact me (ypp@stat.ubc.ca).

An expected format for the report

You may organize your report including the following sections.

Problem definition (1 page): Extract mathematical/statistical problems from the paper and organize them. What are the input data? What is the expected output?
Significance (1-2 paragraphs): Why is this an interesting problem? What can be learned by studying this problem? Why is it exciting for you? Author contribution: How did the author(s) find the solution? What was a novel contribution beyond traditional approaches?
Limitations/challenges (1-2 paragraphs): What are the assumptions? Are they realistic? What are the technical limitations that the authors acknowledge or not?
Novel idea/methods (2 pages): Propose your idea and statistical methods. You could interpret the underlying problem in a different formulation. What are related problems/frameworks, but not adopted by the authors?
Results (1-2 pages): Include one figure that sketches your approaches. Show tables and figures that clearly demonstrate your methods.
Discussion (1 page): Briefly discuss what you have learned and what you would achieve if you were to develop this to a full paper. How would you validate your findings in independent studies, including wet-lab experiments?

Available Papers

Romano, Sesia, and Candes, Deep Knockoffs
- Category: variable selection, causal inference
- Idea: Implement the method using torch (R or Python) and test on synthetic data. Test a variety of scoring functions and benchmark for genetics applications. Real-world genetics data will be made available upon request.
Gu, Blaauw, and Welch, Variational Mixtures of ODEs for Inferring Cellular Gene Expression Dynamics
- Category: single-cell genomics, dynamics, variational inference
- Idea: The goal is to understand the method and implementation in detail in a broad context of likelihood-free inference methods. One can compare this paper with relevant methods based on variational inference, such as Ryder et al., Black-box variational inference for stochastic differential equations. Several review papers are available like this.
Zheng, Aragam, Ravikumar, and Xing, DAGs with NO TEARS: Continuous Optimization for Structure Learning TAKEN
- Category: combinatorial optimization, causal inference
- Idea: Implement the method using torch (R or Python) and test on synthetic data (e.g., linear Gaussian or generalize linear models). The current framework was primarily built on observational data. You are welcomed to extend the approach to a data set with multiple experimental conditions, so that the method can discover causal relationships implicated by the data-generating scheme.
Jung, Kasiviswanathan, Tian, Janzing, Bloebaum, and Bareinboim, On Measuring Causal Contributions via do-interventions
- Category: causal inference
- Idea: The paper takes a theoretical (axiomatic) approach to a causal inference problem. What are the differences between the do-Shapley and other conventional approaches? Several routes can be taken: (1) Understand the framework and methods; revisit the theorem and prove in your own language. (2) One can propose an efficient algorithm to handle a large number of variables in do-Shapely computations.
Hill, Bayesian Nonparametric Modeling for Causal Inference (2012); Louizos et al. Causal Effect Inference with Deep Latent-Variable Models (2017)
- Category: causal inference, Bayesian inference
- Idea: Using counterfactual data modelling/sampling, we can train a model that captures latent representation of confounding factors. Can we combine the idea of a counterfactual model with stochastic variational inference that consumes stochastically-generated minibatch data to update parameters in each epoch? If so, what is optimal sampling strategy to feed in counterfactual data to the inference engine?
Wang, Sarkar, Carbonetto, and Stephens, A simple new approach to variable selection in regression, with application to genetic fine mapping, 2020
- Reference site SuSiE
- Other related papers: Zhu and Stephens, Bayesian large-scale multiple regression with summary statistics from genome-wide association studies (2017). Zou, Carbonetto, Wang, and Stephens, Fine-mapping from summary data with the “Sum of Single Effects” model (2022)
- Category: statistical genetics, summary statistics-based inference.
- Idea: Multiple directions are possible. Extend the idea to summary statistics matrix with more than two columns. Show correspondence with existing polygenic risk prediction methods.