Technical details of LDAK-KVIK

LDAK-KVIK has two steps when used for single-SNP association analysis, and three steps when used for both single-SNP and gene-based association analysis. Below, we summarize the LDAK-KVIK algorithm for each step. For an extensive overview of the LDAK-KVIK algorithm, we refer to the LDAK-KVIK publication (see preprint).

Step 1. Construct LOCO PRS and estimate \(\lambda\)

Test for structure
Estimate \(\alpha\) and \(h^2_j\) using partitioned randomized Haseman-Elston Regression
Revise the estimates of \(h^2_j\) using Monte Carlo REML
Determine suitable elastic net hyperparameters via cross-validation
Contruct LOCO PRS and estimate \(\lambda\)

Step 2. Single-SNP association analysis

Calculate uncalibrated test statistics
Scale test statistics

Step 3. Gene-based association analysis

Run LDAK-GBAT using results from single-SNP association analysis

First we describe each operation in turn, then we explain some implementation details. As a reminder, all regressions use residual genotypes and phenotypes, because this removes the need to include the covariate matrix \(Z\). The residual phenotypes (genotypes) are obtained by computing \(Y = HY^{\top} (X = HX^{\top})\), where \(H = I − Z(Z^{\top}Z)^{−1}Z^{\top}\), then scaling \(Y\) (columns of \(X\)) to have variance one.

Step 1

Operation 1a - Test for structure.

LDAK-KVIK picks 512 SNPs semi-randomly from across the genome, then computes \(\bar \rho^2\), the average squared correlation between pairs of SNPs on different chromosomes. LDAK-KVIK determines there is strong structure if both \(\bar \rho^2\) is significantly greater than 0, and \(n \bar \rho^2>0.1\) (this is an estimate of the maximum average inflation of test statistics due to structure).

Operation 1b - Estimate \(\alpha\) and \(h^2_j\) using partitioned randomized Haseman-Elston Regression.

LDAK-KVIK obtains \(\hat \alpha\) and \(\hat h^2\), estimates of the power parameter and heritability parameter, respectively, by running randomized Haseman-Elston Regression. LDAK-KVIK considers five different values for \(\alpha\) (-1, -0.75, -0.5, -0.25, 0), then sets \(\hat \alpha\) to the value that results in best-fitting covariance matrix, and sets \(\hat h^2\) to the corresponding estimate of \(h^2\). LDAK-KVIK then obtains \(\hat h^2_j\), estimates of the per-SNP heritabilties, by setting \(\hat h^2_j = w_j \hat h^2/W\), where \(w_j=[f_j (1-f_j)]^{1+\hat \alpha}\) and \(W = \sum_j w_j\).

Operation 1c - Operation 1c - Revise the estimates of \(h^2_j\) using Monte Carlo REML.

LDAK-KVIK computes the REML estimate of \(h^2\) by iteratively solving equations involving the heritability estimate, covariance matrix and phenotype (see the LDAK-KVIK publication for more details). The Haseman-Elston estimate of \(h^2\) is used in the first iteration, and updated until converged to an approximate solution of the system of equations. Finally, LDAK-KVIK recomputes the estimates of \(h^2_j\) using the revised estimate of \(h^2\) (continuing to use the estimate of \(\alpha\) from randomized Haseman-Elston Regression).

Operation 1d - Determine suitable elastic net hyperparameters via cross-validation

When constructing Elastic Net PRS, LDAK-KVIK assumes a prior for SNP effect sizes:

\[\gamma_j \sim p DE\left ( \left[\frac{2p}{(1-F)\hat h^2_j} \right ]^{0.5} \right ) + (1-p)N\left(0,\frac{F\hat h^2_j}{1-p}\right )~~~~ \mbox{and} ~~~~ e\sim N(0,1-\hat h^2).\]

The parameter \(p\) determines the contribution of the lasso component, while the parameter \(F\) determines the expected contribution to variance from the ridge regression component. Note that this prior distribution is constructed so that \(E[h^2_j]=\hat h^2_j\) (i.e., the expected per-SNP heritabilities match their estimates from Operation 1c).

LDAK-KVIK uses cross-validation to obtain suitable values for \(p\) and \(F\). By default, it uses 90% of individuals to construct ten genome-wide PRS, for \((p,F)\) equal to (0.5,0.5), (0.5,0.3), (0.5,0.1), (0.1,0.5), (0.1,0.3), (0.1,0.1), (0.01,0.5), (0.01,0.3), (0.01,0.1) and (0,1). It then measures the accuracy of these PRS on the remaining 10% of individuals, selecting the values for \(p\) and \(F\) that result in the PRS with smallest mean-squared error. Note that if \(MSE\) denotes the smallest mean-squared error, then LDAK-KVIK uses \(n/MSE\) as an estimate of the effective sample size for the final association analysis (recall that \(Var(Y)=1\), so \(1/MSE\) estimates the reduction in phenotypic variance when using the LOCO Elastic Net PRS as offsets).

Operation 1e - Construct LOCO PRS and estimate \(\lambda\).

LDAK-KVIK constructs \(P_c\), the \(c\)th LOCO PRS, assuming the elastic net prior distribution with optimal hyperparameters estimated in step 1d. The SNPs on chromosome \(c\) are omitted when computing \(P_c\), such that a different PRS is constructing for each chromosome.

If operation 1a determines there is low structure, LDAK-KVIK sets \(\lambda=1\). If the test in Operation 1a found strong structure, LDAK-KVAK instead constructs a LOCO PRS \(P_c\) assuming the ridge regression prior distribution. It then uses least-squares regression using the model \(E(Y - P_c) = X_j \beta_j\) to compute two sets of unscaled \(\chi^2 (1)\) statistics for 65,000 randomly-picked SNPs: one set of statistics for \(P_c\) constructed using the elastic net prior (denoted by \(U_j\)), and one set of statistics for \(P_c\) constructed using the ridge regression prior (denoted by \(T_j\)). LDAK-KVIK uses the Grammar-Gamma formula to compute \(\lambda'\), the scaling factor corresponding to the set of statistics corresponding to ridge regression. Finally, \(\lambda\) is computed such that the mean of the statistics obtained from the elastic net matches those of the scaled statistics from ridge regression: \(\lambda \sum_{j \in S} T_j == \lambda' \sum_{j \in S} U_j\). Here S is the subset of SNPs such that \(\lambda U_j < 6\) and \(\lambda' T_j < 6\), selecting only weakly-associated SNPs.

Step 2

Operation 2a - Calculate uncalibrated test statistics.

If SNP \(j\) lies on Chromosome \(c\), then LDAK-KVIK tests for association using ordinary least-squares regression with the model \(E[Y-P_c]= X_j \beta_j\), where \(P_c\) is the LOCO PRS constructed in Operation 1e. The estimated effect size is

\[\hat \beta_j = \frac{X^T_j (Y-P_c)}{X^T_jX_j},\]

with estimated variance

\[Var(\hat \beta_j) = \frac{\hat s^2}{X^T_jX_j} ~~~~ \mbox{where} ~~~~ \hat s^2= \frac{(Y-P_c)^T(Y-P_c)}{n-q},\]

and the corresponding \(\chi^2(1)\) test statistic is

\[U_j = \frac{(X^T_j (Y-P_c))^2}{X^T_jX_j \times \hat s^2}.\]

Operation 2b - Scale test statistics.

LDAK-KVIK reports three values for each SNP: an effect size estimate \(\epsilon_1\), an estimate of the variance of the effect size estimate \(\epsilon_2\), and a \(\chi^2(1)\) test statistic \(\epsilon_3\). LDAK-KVIK sets \(\epsilon_1 =\lambda \hat \beta_j\) and \(\epsilon_3=\lambda U_j\), where \(\hat \beta_j\) and \(U_j\) are, respectively, the estimated effect size and test statistic calculated in Operation 2a, while \(\lambda\) is the estimated scaling factor from Operation 1e. LDAK-KVIK then sets \(\epsilon_2=\epsilon_1^2/\epsilon_3\), which ensures that the three reported values are consistent (i.e., that \(\epsilon_3=\epsilon_1^2/\epsilon_2\)).

Step 3

Operation 3a - Run LDAK-GBAT using the results from single-SNP association analysis.

LDAK-KVIK uses the summary statistics of single-SNP analysis to run LDAK-GBAT. LDAK-GBAT performs gene-based association analysis using GWAS summary statistics and a reference panel, and uses REstricted Maximum Likelihood (REML) to solve the model:

\[Y \sim N(0,K_S\sigma^2_S + I(1-\sigma^2_S))\]

Here, \(K_S\) is a “genomic” relatedness matrix constructed using only SNPs within the gene being tested. LDAK-GBAT performs a likelihood ratio test on the alternative hypothesis that \(\sigma^2_S>0\).