On the analysis methods used in the "omnigenic" paper

A recent quanta magazine article on the omnigenic hypothesis re-ignited my interest. So I read the original omnigenic paper again.

I am not against the model itself

First, I should make it clear: please do not get me wrong. I would like to agree with the proposed hypothesis that “gene regulatory networks are sufficiently interconnected such that all genes expressed in disease-relevant cells are liable to affect the functions of core disease-related genes and that most heritability can be explained by effects on genes outside core pathways.” I think we should definitely bring in the concept of genetically associated networks to elucidate the overall impacts of random assortments of genetics on complex traits. Many people have been suggesting that genetics need to embrace tools and ideas of systems biology. I also think Systems genetics will be able to make new breakthroughs, especially when it comes to risk prediction of complex traits.

However, the more I read through the paper, the more I became unsatisfied. I just discover more holes.

Miscalculation of the number of causal SNPs

We next used ashR to analyze the distribution of regression coefficients from the set of all SNPs (Stephens, 2017). ashR models the GWAS results as a mixture of SNPs that have a true effect size of exactly zero, with SNPs that have a true effect size that is not zero.

To my best knowledge, ashR was developed to calibrate false discovery rates of multiple independent hits. See the new deal paper. We all know that SNPs are correlated within haplotype blocks. If distributions on genetic effects of common SNPs are not independent, variance estimation of the null components in the ashR model can be easily deflated, which then leads to inflated number of non-zero effects. If I read the description correctly, their subsequent analysis taking into account of linkage disequilibrium (LD) structure already relies on the results drawn by fitting the ashR model.

Therefore, unless I see the same results from fitting a fine-mapping model or a sparse polygenic model, the following arguments will remain less convincing to me.

Remark 1:

Given that the typical extent of linkage disequilibrium (LD) is around 10–100 kb, this implies that most 100-kb windows in the genome include variants that affect height.

  • If 10-100 kb blocks were defined by $r^2$ values based on sliding window approach, we might be omitting long-range LD patterns.

  • How many independent signals will be found after fine-mapping / conditional analysis?

Remark 2:

Stratifying the ashR analysis by the LD score for each SNP, we see a clear effect that SNPs with more LD partners are more likely to be associated with height.

  • The stratification model in the supplementary material is merely a posthoc approach to prevent from double-counting, not a joint analysis of the ashR and the LD score regression.

Enrichment analysis can be interpreted in many different ways

I am not a big fan of enrichment analysis although it is heavily used in most data analysis. One of my strongest complaints against enrichment analysis is that it really depends on size of sets–positive, negative and overlapping hits. In practice, we rarely use theoretical null distribution of set sizes to calibrate a p-value since it can easily achieve high degree of overlap just by other confounding effects. A crucial question is: What is your background for the enrichment? And we can only reject that specific background distribution we crafted in for hypothesis test.

Genetic contribution to disease is heavily concentrated in regions that are transcribed or marked by active chromatin in relevant tissues but there is little enrichment for cell-type-specific regulatory elements versus broadly actively regions.

Moreover, as their analysis heavily used the stratified LD-score regression, the enrichment effect sizes (and the corresponding p-values) should be interpreted more carefully. Larger effect sizes do not necessarily mean stronger enrichment. I can easily imagine that we would explain a larger fraction of heritability by a broader annotation category. Well, in some sense, the following sentence supports my argument against their interpretation.

These enrichments [autoimmune disease and schizophrenia GWAS in the relevant categories] were relatively modest, and for all three diseases, we observed a strong linear relationship between the sizes of the functional categories and the proportion of heritability that they contributed.

  • We could adjust size of annotations by comprehensive permutation analysis.

Conclusion

Some people might be able to accept the message and gain useful insights beyond potential issues in the analysis method. Unfortunately I am not one of them. Of course, I can be wrong as well. Probably it would be more useful if they could disclose analysis pipeline for follow-up studies, even though it is a “perspective” paper.

Many people say that the omnigenic hypothesis is a glorified polygenic model, originally proposed by R.A. Fisher. I only partially agree with that simplified view. I think genetics is more complex than that as much as it is fascinating. This omnigenic paper is one of systems genetics efforts, yet it never rule out “monogenic” aspects of disease etiology.

We are a computational biology lab, born in 2020, in the midst of Sars-COV 2 (covid-19) pandemic.