## Abstract

Two-sample summary data Mendelian randomisation (MR) is a popular method for assessing causality in epidemiology, by using genetic variants as instrumental variables. If genes exert pleiotropic effects on the outcome not through the exposure of interest, this can lead to heterogeneous and (potentially) biased estimates of causal effect. We investigate the use of Bayesian model averaging (BMA) to preferentially search the space of models with the highest posterior likelihood. We develop a bespoke Metropolis-Hasting algorithm to perform the search using the recently developed Robust Adjusted Profile Likelihood (MR-RAPS) of Zhao et al as the basis for defining a posterior distribution that efficiently accounts for pleiotropic and weak instrument bias. We demonstrate how our general modelling approach can be extended from a standard one-parameter causal model to a two-parameter model, to allow a large proportion of SNPs to violate the Instrument Strength Independent of Direct Effect (InSIDE) assumption. We use Monte Carlo simulations to illustrate our methods and compare it to several related approaches. We finish by applying our approach in practice to investigate the changes in causal effect of their resulting high risk metabolite on the development age-related macular degeneration.

## 1 Introduction

The capacity of traditional observational epidemiology to reliably infer whether a health exposure causally influences a disease rests on its ability to appropriately measure and adjust for factors which jointly predict (or confound) the exposure-outcome relationship. Mendelian randomization (MR) [1] avoids bias from unmeasured confounding by using genetic variants as instrumental variables (IVs) [2]. In order for the approach to be valid for testing causality, each specific IV must be robustly associated with the exposure (assumption IV1), independent of any confounders of the exposure and outcome (IV2) and be independent of the outcome given the exposure and the confounders (IV3). This is illustrated by the causal diagram in Figure 1a.

The recent boom of genome-wide association studies (GWAS) [3] has triggered the development of MR approaches that utilise this widely available summary data source. Specifically, ‘two-sample summary data’ MR is a design that derives causal effect estimates with summary statistics obtained from two separate GWAS - one supplying the Single Nucleotide Polymorphism (SNP)-exposure associations and the other supplying the SNP-outcome associations [4–7] - a SNP being the most common type of genetic variation in the genome. If the chosen SNPs are valid IVs, and the causal effect of a unit increase in *X* on the mean value or risk of *Y* is approximately linear in the local region of *X* predicted by these variants [8] then a simple inverse-variance weighted (IVW) meta-analysis of SNP-specific causal estimates provides an approximately unbiased estimate of this causal effect. If sufficient heterogeneity exists between the MR estimates across a set of variants, this suggests that some of the SNPs may in fact violate the IV assumptions. This could be due to assumption IV1 being only weakly satisfied by the genetic variants (i.e. weak instrument bias), which can easily be accounted for [9, 8]. It is however more problematic when the heterogeneity is caused by violations of assumptions IV2 and IV3 [10, 11]. The latter violation is commonly known as “horizontal pleiotropy” [12], and hereafter referred to as pleiotropy for simplicity. Pleiotropy does not necessarily lead to biased causal estimation if it is balanced, in the sense that the average pleiotropic bias across SNPs is zero and the weight each SNP receives in the analysis is also independent of this bias. This latter condition is referred to as the Instrument Strength Independent of Direct Effect (InSIDE) assumption [13, 14]. However, this assumption is itself unverifiable.

Methods have been developed that are naturally robust to pleiotropy and In-SIDE violation. For example, the weighted median estimator [15] provides a consistent estimate under the assumption that 50% of the SNPs are valid IVs (or not pleiotropic). Similarly, mode-based estimation strategies focus on identifying the largest subset of variants yielding a homogeneous causal estimate, and are consistent when this set is made up of valid IVs [16, 17]. These approaches do not make any assumptions the nature of the pleiotropy for invalid SNPs - it could violate InSIDE or not. Other approaches, such as MR-PRESSO [18] and Radial MR [9] attempt to detect and remove SNPs that are deemed responsible for bias and heterogeneity in an MR-analysis. They can in theory provide consistent estimates for the causal effect if the SNPs responsible for InSIDE violation can be removed from the analysis so that only balanced pleiotropy remains. Finally, the Robust Adjusted Profile Score (MR-RAPS) [8] takes a subtly different approach. It accounts for weak instruments and balanced pleiotropy using an adjusted profile likelihood, which penalizes outlying SNPs that may induce bias in the analysis using a robust loss function.

In this paper we develop a method for pleiotropy robust MR analysis with twosample summary data using the general framework of Bayesian Model Averaging (BMA) [19]. BMA incorporates uncertainty about the effects of pleiotropy and weak instruments into MR estimate by pooling the causal effect estimates from all possible combinations of the genetic instruments with appropriate weights. In this paper, we adapt this general approach to the summary data setting where the SNPs are uncorrelated but potentially pleiotropic. Our approach uses the profile likelihood of MR-RAPS [8] as a basis for efficiently modelling the summary data in the presence of weak instrument bias (IV1 violation) and pleiotropy (IV2-3 violation), but with the addition of an indicator function to denote whether an individual SNP is included or disregarded in the model. We develop a bespoke Metropolis-Hastings BMA algorithm to intelligently search the space models defined by all possible SNP subsets (i.e ≈ 2^{L} in the case of *L* SNPs) in order to decide which SNPs to include in the identified set of valid IVs within a given iteration of the markov chain. For this reason, we call our method BayEsian Set IDE Mendelian randomization (BESIDE-MR). BESIDE-MR naturally up-weights large sets of variants that furnish consistent, homogeneous estimates of causal effect, and down-weights sets of variants that provide heterogeneous estimates of causal effect. It also naturally accounts for uncertainty introduced by SNP selection across models, which we will show is important for preserving the coverage of resulting MR estimates.

In Section 2.1 and 2.2 we introduce the methodology behind our basic approach and in Section 3 assess its performance in Monte-Carlo simulations. In Section 4 we show how the basic one-parameter causal model can be extended to account for the case where a substantial proportion of the SNPs exhibit pleiotropy violating the InSIDE assumption. In Section 5 we apply our approach to investigate the causal role of high density lipoprotein cholestorol (XL.HDL.C) on the risk of age related macular degeneration (AMD) using data from the 2019 MR Data Challenge [20]. We conclude with a discussion and point to further research.

## 2 Motivation and Method

### 2.1 Description of the general model

Suppose that we have data from an MR study consisting of N individuals, where for each subject *k* we measure *L* independent genetic variants (*G*_{k1}…*G*_{kL}), an exposure (*X*_{k}) and an outcome (*Y*_{k}). *U*_{k} represents the shared residual error between *X* and *Y* due to confounding, which we wish to overcome using IV methods. We assume the following linear structural models [21] for *U, X* and *Y* consistent with Figure 1b:
where and are independent error terms for *U, X* and *Y* respectively. From this we can derive the approximate reduced form models for the *G*-*X* and *G*-*Y* associations for SNP *j* :

We use ‘approximate’ here because the error terms and not strictly constant or mutually independent - the *j*th residual error term in fact contains common contributions from all other genetic variants not equal to *j*. This approximation is very accurate in most settings because the genetic variants combined make a very small contribution to the total residual error in each model (e.g. typically of the order of 1-2%). For further justification see Zhao *et al.* [8]. Under this assumption the following models can then be justified for summary data estimates of the *G*-*X* and *G*-*Y* associations gleaned from fitting (1) and (2):

Here, *α*_{j} = *Δ*_{j} + *κ*_{y}*ψ*_{j}, and *γ*_{j} = *δ*_{j} + *κ*_{x}*ψ*_{j}. Model (3) is typically applied in the two-sample summary data setting. Under this design it is assumed that the first study provides the *G*-*X* associations and standard errors *σ*_{Xj}, and the second study provides the *G*-*Y* associations and standard errors *σ*_{Yj}. Both the standard errors are assumed to be fixed and known. The two-sample design implicitly assumes that SNP *j* has an identical association with the outcome in studies 1 and 2. Since the two studies are independent, it is also assumed that the uncertainty in is independent of the uncertainty in . For a detailed description of all the assumptions underlying the two-sample approach see Bowden *et al.* [11] and Zhao *et al.* [22].

The individual Wald ratio estimand for SNP *j* from model 3 is then

From this we see that:

– A SNP is invalid due to pleiotropy if

*α*_{j}≠ 0– A SNP is invalid due to InSIDE respecting pleiotropy if

*α*_{j}≠ 0 but*ψ*_{j}= 0– A SNP is invalid due to InSIDE violating pleiotropy if

*α*_{j}≠ 0 and*ψ*_{j}≠ 0.

InSIDE violation occurs in the last case because instrument strength and pleiotropic effects are functionally related due to a shared *ψ*_{j} component, so that the sample covariance . For the case of InSIDE respecting pleiotropy we are able to assume the sample covariance is approximately zero for a sufficient number of instruments, since *Δ*_{j} and *δ*_{j} are imagined to be themselves generated via independent processes [11]. In Appendix 1, we show, under the simplifying assumption that the SNP-outcome standard errors are approximately constant, that IVW estimator for the causal effect
as the sample size grows large. If all SNPs are pleiotropic, but satisfy the InSIDE assumption and have mean of zero , then numerator of the bias term is zero and the standard IVW estimate provides a reliable way of estimating *β*. MR-Egger regression is an extension of the method that can work under the InSIDE assumption even if , which is referred to as ‘directional’ pleiotropy. It does this by estimating an intercept parameter in addition to the causal slope parameter. However, this approach has several downsides; its estimates are generally very imprecise and it is not invariant to allele recoding [23]. Lastly, it can not separate non-zero mean pleiotropy satisfying the InSIDE assumption from zero mean pleiotropy violating the InSIDE assumption. Its intercept reflects the numerator of the bias term, which is a combination of both. This motivates the use of methods that can attempt to detect and down-weight a small number of variants that may be responsible for either InSIDE violation or directional pleiotropy so that, for the remainder of SNPs left, model (3) holds with only InSIDE respecting balanced pleiotropy remaining. This is the approach taken by Zhao *et al.* [8] and Verbanck *et al.* [18], and is the approach we will initially pursue using BESIDE-MR.

### 2.2 Bayesian Model Averaging over the summary data model

We are interested searching over the space of all possible models defined by each of the 2^{L} subsets in the entire summary data. Let *I* = (*I*_{1},*…, I*_{L}) be the *L*-length indicator vector denoting whether SNP *G*_{j} is included (*I*_{j} = 1) or not (*I*_{j} = 0) in the model. The model we want to ‘force’ our data to conform to is model (3) with the additional assumption that

The parameters of interest are then *θ* = (*β, τ*^{2}, *I*) and with data, *D*, that consists of and , and their standard errors *σ*_{Xj} and *σ*_{Y j} respectively. Then the joint posterior is
where *P*(*D*|*θ*) is the likelihood and *P*(*θ*) is user specified prior for each of the parameters.

As we rarely know the identity of the pleiotropic instruments, BMA offers the uncertainty about their identity and gives the inclusion posterior probability to quantify how likely an instrument is invalid. BMA achieves this by considering all possible models, where the models are different combinations of the potential set of genetic instruments and averages the resulting causal effect estimate with appropriate weights (in the form of posterior probability of how often the instrument is included during MCMC). The selection of instruments is conditional on the likelihood of the data and the given priors. The prior in Bayesian analysis reduces the effect of weak instrument bias, even with noninformative prior. However, we must advice against this, as reasons discussed by Thompson *et al.* [24]. This method is particularly attractive as it has also been found to reduce bias from many weak instruments in econometric [25, 26] and in one-sample MR with highly valid but highly correlated genetic instruments [27]. For a comprehensive tutorial of BMA see Hoeting *et al.* [19].

### 2.3 The profile score likelihood

For *P*(*D*|*θ*), we use the profile log-likelihood score derived by Zhao *et al.* [8]. Specifically this is the likelihood for (*β, τ*^{2}) given the data profiled over the parameters *γ*_{1}, *…, γ*_{L}. After the incorporation of our indicator vector *I*, the likelihood is modified to

This likelihood allows for heterogeneity due to pleiotropy via *τ* ^{2}, and weak instruments, via . Failure to account for weak instrument bias can lead to bias in the standard IVW estimate and inflation in related heterogeneity tests even under balanced pleiotropy [23]. Note that this likelihood is an increasing function of number of instruments included and decreasing function of *τ*^{2}. Hence, our BMA algorithm will naturally give more weight to *I*-vectors that include large sets of instruments with homogeneous causal effect estimates. This property is reminiscent of ZEro Modal Pleiotropy Assumption (ZEMPA) [16] or plurality rule that defines the two-stage hard thresholding (TSHT) approach of Guo *et al.* [28]. However, there is an important distinction. The TSHT approach explicitly aims to isolate the largest set of ‘valid’ instruments and base all inference on this single set, which is equivalent to giving a single *I*-vector a weight of 1 and all other vectors a weight of zero. Our approach is less aggressive, allowing as many distinct *I*-vectors as are supported by the data to be given weight in the analysis. This feature properly accounts for model uncertainty. Indeed, as subsequent simulations will demonstrate, this yields causal estimates and standard errors that are less prone to under-coverage than methods which incorporate instrument selection or penalization.

One such method of penalization, also proposed by Zhao *et al.* [8], is MR-RAPS. Instead of being based on likelihood function (5) which uses standard least squares or *L*_{2} loss plus the addition of our indicator function, it uses a robust *L*_{1} function such as Huber or Tukey loss. They enable the contribution of large outliers to be penalized (i.e. reduced) compared to *L*_{2} loss. Our BMA indicator function implemention of the standard profile likelihood can be viewed as an alternative way to achieving the robustness of MR-RAPS, because it will give more weight to model choices in which outliers are removed.

The profile likelihood is particularly well suited to a Bayesian implementation because it enables heterogeneity due to weak instrument bias and pleiotropy to be accounted for, whilst only having to update three parameters, (*β, τ* and *I*). Weak instrument bias is traditionally addressed using standard (non-profile) likelihood formulae (e.g. see Thompson *et al.* [24]), but this would require the posterior distribution of an additional *L* parameters (*γ*_{1},*…, γ*_{L}) to be estimated, and is far more computationally intensive.

### 2.4 Choice of priors

In general we encourage the construction of priors to be based on previous epidemiological study or biological knowledge. For the purpose of elucidating our approach, we will use priors that ensure efficient mixing and rapid convergence. For the causal effect parameter *β*, we use a zero centered normal prior *P*(*β*). For the pleiotropy variance we use a gamma prior *P*(*Prec*) for the precision, where *Prec* = 1*/τ*^{2}. For the indicator function prior, we will assume an uninformative Bernoulli prior *P*(*I*) with probability for all *I*_{j}.

### 2.5 Metropolis-Hastings algorithm

We use a random walk Metropolis-Hastings (M-H) algorithm for updating the model parameter values. Unlike Gibbs sampling, the M-H algorithm does not directly sample from the conditional posterior distribution, but instead requires a proposal distribution for each parameter. Let be the current *i*th value of the parameter vector *θ. θ*_{i} is updated to *θ*_{i+1} one parameter at a time, by simulating a candidate value *θ*^{∗} from proposal density, until it is accepted. Note that if the proposal density *C*() for a given parameter is ‘symmetric’ - that is if *C*(*θ*_{i}|*θ*_{i+1}) = *C*(*θ*_{i+1}|*θ*_{i}) then the proposal density can be omitted from the calculation of the acceptance probability. This is the case for *β* and *I*, but not *τ*^{2}. In Appendix 2, we give the specific details of the M-H algorithm.

### 2.6 An alternative implementation

It is well known that the estimation of *τ* ^{2} is challenging, even within a classical framework, see Zhao *et al.* [8] for further discussion. Therefore, we propose an alternative implementation of our M-H algorithm in which a plug-in estimate for *τ*^{2} is substituted at each iteration. For simplicity, we chose to use the closed-form DerSimonian-Laird estimate for *τ* ^{2} [29];
where
and respectively. Note that *I*_{j} should not be confused with Higgin’s *I*^{2} statistic used to quantify heterogeneity in Meta-analysis. In Appendix 2, we describe how the M-H algorithm is modified to implement this alternative approach. Hereafter, we will refer to the first method as the ‘full Bayesian’ approach and this latter method as the DerSimonian-Laird (DL) approach.

## 3 Monte Carlo simulation

### 3.1 Simulation strategy

In order to assess the performance of our BMA algorithm we simulate two-sample summary MR data sets with *L*=50 instruments from model (3). The parameters *γ*_{j} were generated from a Uniform *U*(0.34, 1.1) distribution, *σ*_{Xj} was generated from a Uniform *U*(0.06, *UB*) and *σ*_{Yj} was generated from a Uniform *U*(0.015, 0.11) distribution. The upper bound on the G-X association standard error *UB* was used to determine mean instrument strength - with 0.095 ≤ *UB* ≤1 giving mean F-statistics between 10 and 100 respectively. In this setting, the F-statistic for a single SNP can be approximated as .

We generated pleiotropic effects from a *N*(*µ*_{α}, *τ*^{2}) distribution, with the parameter *µ*_{α} being used to determine the mean bias induced by including the invalid instruments in the model. The task of our BMA algorithm in the presence of a non-zero *µ*_{α} is to give large weight to models which include SNPs for which *µ*_{α} ≈ 0. Apart from a potential non-zero mean bias, the simulated pleiotropic effects satisfy the InSIDE assumption.

#### Convergence

We test the convergence of our algorithm in different scenarios. From this we determined that it was necessary to use 50,000 iterations with 10,000 burn-ins for our algorithm to function effectively. Except for rare occasions, we removed results from data where number of iterations chosen was not sufficient for convergence. See details in Appendix 3 of the SM available at *Biostatistics* online.

#### A comparison with IVW, MR-APS and MR-RAPS

We first compare our approach with the standard IVW method, MR-APS and MR-RAPS. The latter two are the classical counterpart that our approach sits between. We monitor the following quantities across our simulations:

– Mean bias of the causal parameter estimate. For our BMA algorithm we use the mean of the posterior distribution of

*β*to assess this;– Coverage: For IVW, MR-APS and MR-RAPS this is based on 95% symmetric confidence intervals assuming normality. For BESIDE-MR this is based on a 95% credibility interval;

– The inclusion probability for each SNP (BESIDE-MR only).

Four scenarios are considered; (1) balanced pleiotropy with strong instruments, (2) balanced pleiotropy with weak instruments, (3) directional pleiotropy with strong instruments and (4) directional pleiotropy with weak instruments (see Table 1 for a summary). Within each scenario, four sub-scenarios are considered where 0% to 100% of the SNPs are simulated as invalid/pleiotropic instruments. In order capture the amount of heterogeneity present in the data, we report the exact *Q*-statistic [9]:

Note that only invalid SNPs which have a non-zero pleiotropic effect make a non-nominal contribution, so that, for a fixed set of pleiotropy parameters *α*_{1},*…, α*_{L}:

Table 2 shows the results.

### 3.2 Results

#### bias and coverage

Under scenario 1, all methods deliver approximately unbiased estimates. The IVW, MR-APS and MR-RAPS estimators achieve nominal coverage when there are no pleiotropic instruments. However, as the proportion of pleiotropic instruments (and hence the heterogeneity) increases, their coverages can drop substantially, with the MR-APS and MR-RAPS estimators most affected. BESIDE-MR approach has conservative coverage under no heterogeneity but maintains far better coverage when this increases. Scenario 2 is the same as Scenario 1, except the SNPs are now weaker. The general pattern is the same, except the coverage of IVW is lower and its estimate is negatively biased. This is as expected because it uses inverse variance weights that assume for all SNPs [9]. The results remains the same with further simulation of many weak instruments (*L*=100), see Appendix 3.

In scenarios 3 and 4, all the approaches deteriorate with increasing number of invalid instruments, but BMA has consistently the least bias and best coverage throughout. In Scenario 4, the IVW estimator is least biased, due to weak instrument bias cancelling out some of the pleiotropic bias. With 40% and 60% invalid instruments, full Bayesian BESIDE-MR struggled to converge within 50,000 iterations in a small number of cases.

#### SNP inclusion

Figure 2a shows the difference between the mean probability of including valid (non-pleiotropic) SNPs and the mean probability of including invalid (pleiotropic) SNPs in our BMA likelihood for Scenarios 1 and 3. This difference is zero when there are no invalid instruments. Under Scenario 1 this difference is maximised (i.e. we get the best discrimination) when there are 20% invalid instruments, this difference steadily decreases to half its value as the number of invalid instruments increases further. Under Scenario 3 we see a smaller and more constant difference across different proportions of invalid instruments, indicating that the BMA likelihood generally struggles to deal with directional pleiotropy. There is still difference in inclusion posterior probability between valid and invalid instruments, however not in the same magnitude for weaker and many weak instruments, see Appendix 3.

Additional simulations were performed to investigate the effect of different patterns of heterogeneity, but a fixed and borderline amount total heterogeneity, on the difference in inclusion probabilities of valid and invalid SNPs. Specifically, we generate summary data under balanced pleiotropy with *Q* fixed at 66, but varying the source of of this heterogeneity from 25 weakly pleiotropic SNPs (with individual *Q* contributions ≈ 3), to 11 strongly pleiotropic SNPs (with individual *Q* contributions ≈ 6). Figure 2b shows the results. As expected, the discrimination is best with small numbers of highly pleiotropic SNPs, and the worst with large numbers of weakly pleiotropic SNPs. However, the algorithm maintains its reliability in even in this case. For further results see Appendix 3.

## 4 An extended two parameter BMA model for substantial InSIDE violation

In Section 2.2 we introduced the BMA framework for summary data MR and applied it directly using the MR-APS likelihood of [8]. This method assumed that most SNPs were valid, but a small proportion could be invalid and directionally pleiotropic under the InSIDE assumption. The simulations in Section 3 showed that it performed well when this was the case, but like all other approaches, it suffered when a large number of SNPs were directionally pleiotropic, thus inducing both heterogeneity and bias into the results. We now consider the use of an extended BMA model to account for the extreme case where large numbers of SNPs may be invalid and in addition, violate the InSIDE assumption (Figure 1b).

Using the same underlying data generating model 3, suppose that we have two different groups of invalid instruments: in the first group, *S*_{1} we have *ψ*_{j} = 0 for all SNPs and , shown in Section 3.1. That is, the SNPs in *S*_{1} exhibit balanced pleiotropy under the InSIDE assumption. For illustrative purposes, suppose now that the remaining instruments are in a set *S*_{2}, defined by *δ*_{j} = 0, *Δ*_{j} = 0 and *κ*_{x} = *κ*_{y} = 1, but *ψ*_{j} have Uniform *U*(0.34, 1.1) distribution. This means that that *α*_{j} = *γ*_{j} = *ψ*_{j}, so that the InSIDE assumption is perfectly violated. Using the bias formulae in equation (4), it follows that
where *β*^{∗} = *β* + 1. The set of SNPs in *S*_{2} therefore identify a distinct, biased version of the causal effect. In the general case where the SNPs could be classified into an InSIDE-respecting set and an InSIDE-violating set, it would be more reasonable to assume that *α*_{j}, *γ*_{j} and *Δ*_{j} could all be non-zero. Although InSIDE would not then be maximally violated in *S*_{2} we would still see two clusters in the data, albeit with a less defined separation. This motivates the development of an extended two-parameter version of our BMA algorithm to look for evidence of two clusters or slopes in the data.

### 4.1 A modified BMA algorithm

Under the generating model in (3), we further assume that the pleiotropic effects for valid SNPs in *S*_{1} are generated from a distribution and the set of invalid SNPs in *S*_{2} are generated from a distribution. Allowing these SNPs to violate InSIDE, and therefore identify a different slope parameter, our total parameter space is modified to , with likelihood:

Where the indicator functions *I*_{1j} and *I*_{2j} denote whether a SNP *j* is included in *S*_{1} or *S*_{2}. We impose the condition that *I*_{1j} + *I*_{2j} ≤ 1, which means that, at a given iteration of our BMA algorithm a SNP is either

– In

*S*_{1}: (*I*_{1j}= 1,*I*_{2j}= 0)– In

*S*_{2}: (*I*_{1j}= 0,*I*_{2j}= 1)– In neither

*S*_{1}or*S*_{2}: (*I*_{1j}=*I*_{2j}= 0)

This gives the model the flexibility to assign a SNP to either set, or remove it from the model completely. In Appendix 4, we give further details on the M-H algorithm to update the parameter space of this extended model.

As with our one-parameter causal model, we propose a simplified implementation of the two-parameter model in which the variance component parameters and are replaced with plug-in data derived estimates using the DL formula.

### 4.2 Simulation study

We conduct a simulation study to test the ability of our new two-parameter model to correctly estimate causal effects allowing for InSIDE violation. Two-sample summary data are simulated with 50 SNPs under balanced pleiotropy but with progressively larger proportion of SNPs maximally violating the InSIDE assumption. This changes the proportion of SNPs that are in set *S*_{1} and *S*_{2}. These data are simulated under a strong instrument scenario (, Scenario 5) and a weak instrument scenario (, Scenario 6). For precise details of the simulation parameters see Table 3. We also explore the performance of our two-parameter model under balanced pleiotropy with weak and strong instruments when there is no InSIDE violation. That is, under Scenario 1 and 2. This means that all SNPs are effectively in set *S*_{1} the data can be described with a single causal slope parameter, not two. The full results are shown in Table 4 where we report the bias, root mean squared error, coverage and mean Q statistic of all approaches across 1000 simulations.

### 4.3 Results

For data generated under Scenario 1 and 2, and so in the complete absence of InSIDE-violating SNPs in set *S*_{2}, our two slope model correctly identifies *β* and does not try to estimate a second effect, i.e. *β*_{1} = *β*_{2}.

When the data are generated under Scenario 5 we see that, when *S*_{1} and *S*_{2} have a similar number of instruments, both *β*_{1} and *β*_{2} can be estimated by the DL implementation of our two-parameter model. If the proportion of SNPs in either set is too small, then our algorithm tends to remove them completely and focus on estimating just one slope.

The full Bayesian implementation of our two slope model is shown to be more challenging to fit. It returns mean posterior estimates that are median unbiased but not mean unbiased. This demonstrates a lack of convergence, and indicates that longer iterations and a more sophisticated procedure for deciding on the tuning parameter may be required to properly fit the model.

When the data are generated with weaker instruments (Scenario 6), we see degrading in the performance of all approaches. In particular, see that the effect is worst for *β*_{2}. This is because, in our specific simulation, *β*_{2} is larger in magnitude than *β*_{1}, which increases both the heterogeneity (as measured by *Q*) and the absolute magnitude of weak instrument bias relative to that experienced when estimating *β*_{1}. This in turn effected the coverage of the estimates. We further reduced the strength of the instruments, to have mean F-statistics of 10, our approach no longer assigns instruments to either set as the heterogeneity from the *β*_{2} is too large to be pooled to a particular effect estimate, see Appendix 5.

When applying the full Bayesian approach in Scenario 6, we noticed an important feature most prominent when there was a large imbalance in the relative sizes of *S*_{1} and *S*_{2}. In this case, the M-H algorithm can switch from estimating the posterior for *β*_{1} to estimating the posterior for *β*_{2}. This problem is referred to as “label switching” [30]. In our applied analysis in Section 5, we discuss this issue in more detail, and our proposal for addressing it.

Figure 3a (left and right) gives further insight into how well the DL and full Bayesian implementations can correctly partition the SNPs into their correct sets. The x-axis shows the true ratio of SNPs in *S*_{1} and *S*_{2} and the y-axis shows the difference in the mean probability of each SNP being assigned to *S*_{1} and *S*_{2}. For the DL implementation, this probability is reassuringly high when the ratio is between 4:1 or 1:4, and is maximised when the ratio is 1:1. By contrast, the difference in mean inclusion probabilities for the full Bayesian approach are much more constant across all ratios and are also consistently lower.

## 5 Applied example: Age-related macular degeneration and cholesterol

Age-related macular degeneration (AMD) is a painless eye-disease that eventually leads to vision loss. Recent GWAS have identified several rare and common variants located in gene regions that are associated with lipid levels [31], fuelling speculation as to whether the relationship was causal [32, 33]. To this end, a multivariable MR analysis provided evidence to support a causal relationship between AMD and HDL cholesterol but not with LDL cholesterol and triglycerides [34]. In follow up work, Zuber *et al.* [35] fitted a multivariable MR model using Bayesian model averaging, with a total of 30 separate lipid fraction metabolites acting as the intermediate exposures. Out of the 30, large particle HDL cholesterol (XL.HDL.C) had the highest inclusion posterior probability as a risk factor for AMD.

Although multivariable MR approaches can remove bias due to pleiotropy via known pleiotropic pathways (in this case, other lipid fractions), they can be much more challenging to fit, especially when the correlation between the included exposures is high. For this reason we now revisit this data and use our univariate MR approaches to probe the causal relationship between XL.HDL.C and AMD.

We selected 54 genetic variants from Kettunen *et al.* [36] as instruments, due to having individual F-statistics for their association with XL.HDL.C greater than 2. Across all instruments this gave a mean F-statistic of 10. The summary scatter plot for these data is shown in Figure 4. The results for our various data analyses are given in Table 5.

When one-parameter causal models are fitted to the data, all methods estimate a positive causal association, with full Bayesian BESIDE-MR and the MR-RAPS estimators giving the largest effects and the IVW method giving the smallest effect. This is not surprising because the IVW estimate is known to be vulnerable to weak instrument bias. Figure 5 shows the inclusion probability for each instrument, using our two implementations. The DL approach is seen to more aggressively select or de-select instruments than the full Bayesian approach.

Next, we fit our two-parameter causal model, which offers robustness to a substantial portion of the SNPs violating the InSIDE assumption. Interestingly, we see that this estimates two distinct causal effects of opposite sign (Table 5). For the DL approach, approximately 17 SNPs have a strong inclusion probability (*>* 0.8) to each of the 2 clusters, with 1 SNP (rs103294) belonging to neither, (see Figure 6). For the full Bayesian approach, 12 instruments have a strong inclusion probability in the set identifying a positive relationship and only 2 instruments for the negative relationship (hence 0 is within the credible interval for this smaller set). SNP rs103294 also have low probability of being in either set. (Figure 7).

Our tentative conclusion here is that a small proportion of InSIDE-violating SNPs act to reduce the apparent causal effect of XL.HDL.C on AMD detectable by a one-parameter model. Once this set has been accounted for within a two-parameter model, this increases the evidence in favour of a causal role of XL.HDL.C on AMD further. Our results are consistent with Zuber *et al.* [35] who also found subsets of SNPs which suggested qualitatively different conclusions about the causal role of XL.HDL.C on AMD.

### 5.1 Detecting and adjusting for label switching in the full Bayesian model

The trace plots in Figure 8a and 8b show that the DL implementation consistently identifies two separate distributions for *β*_{1} and *β*_{2}, which are centered around 0.89 and -0.66 respectively. This is not the case, however, under the full Bayesian implementation. Trace plots 8c and 8d show that the chains for *β*_{1} and *β*_{2} jumping between two distinct values pattern. This is commonly known as ‘label switching’. It has been recommended that, instead of adjusting the MCMC algorithm itself, one can simply re-allocate iteration labels from the output [30] instead. To this end we performed a K-means clustering analysis [37] on the MCMC output. Before K-means correction, the mean posterior distribution of *β*_{1} and *β*_{2} gave 0.21 and 0.16 respectively. K-means analysis clustered 214,882 iterations centred at 0.88 and the second cluster contains 185,119 iterations with mean of -0.56. We re-assigned the estimates (to *β*_{1} and *β*_{2}) accordingly (see Figure 8e) which gave new posterior distribution with mean and credible interval shown in Table 5. This issue further emphasizes the importance of carefully implementing the fully Bayesian approach, and for checking MCMC output for convergence issue.

## 6 Discussion

In this paper we propose a Bayesian Model Averaging approach that offers robustness pleiotropy and weak instruments. Our approach can be viewed as a Bayesian extension of the classical MR-RAPS approach, but with two advantages. The first is that SNPs deemed to be pleiotropic can be effectively down-weighted, without dramatically affecting the coverage of the causal estimate. The second is robustness to large proportion of SNPs violating the InSIDE assumption. Rather than assuming the InSIDE violating SNPs are small in number and can be effectively penalized in the analysis, they can instead include be included using our two-parameter model. We were able to demonstrate the potential utility of this extended model in our applied example to uncover sub-signals in the data that would be missed by conventional methods. We explored two implementations of BESIDE-MR, namely the full Bayesian and the simplified DL implementation. Our simulations showed that the DL implementation generally performed well, and led to a more decisive selection of SNPs as either in or out of the model (or into set *S*_{1} or *S*_{2} in the two slope case) than the full Bayesian approach. It was also much more straightforward to fit and achieve convergence. Despite the fully Bayesian implementation requiring more computational time and careful consideration of the MCMC output, it is far better at detecting small effects and consistently identifying outlying instruments. We will attempt to improve the reliability of the full Bayesian approach. One aspect of this will be to create a label switching algorithm [38] for the output from full Bayesian model. Another would be to specify a more sophisticated procedure for optimising the tuning parameter for each model parameter separately. In the meantime, we urge users of the full Bayesian approach to manually adapt the tuning parameters and carefully monitor the mixing and convergence of the MCMC chains (especially for the latter approach), which are essential aspects of the analysis. As seen in Appendix 3, diagnostic tools such as performing multiple chains with different initial values and trace plots can be used in this regard. For a comprehensive tutorial see Albert [39] and Lunn *et al.* [40]. For our two-parameter model, it is also important to note that in the presence of weak instruments, the results from our approach must be interpreted with care and even more so when most instruments aren’t assigned to either clusters.

A useful additional output from our BMA approach compared to classical approaches is the inclusion probability for each SNP. This of course necessitates the specification of a prior probability of inclusion, which we fixed at a constant value of . Ideally, one should use informative priors where possible. Indeed, there are multiple sources of external information, e.g. epigenetic databases and bioinformatic webtools that could be used to achieve this. For example, a genetic variant that is located in a protein coding gene that is relevant to the pathway between exposure and outcome of interest can be given a higher inclusion prior probability. Conversely, we might give a much lower inclusion prior probability if the variant is located in a gene that is expressed in multiple tissues. This is again a topic for future research.

For individual-level data MR analyses, Berzuini *et al.* [41] have recently suggested a Bayesian approach that uses a horseshoe shrinkage prior on the possible pleiotropic effect of each instrument. When the pleiotropic effect is small, it is shrunk toward zero, thereby increasing the instrument’s influence on the causal effect estimate. Their simulation showed their prior is reasonably robust to directional pleiotropy. Our approach is currently not robust to pure directional pleiotropy, although it is robust to apparent directional pleiotropy caused by violation of the InSIDE. As future work we intend to explore extending our approach to model pure directional pleiotropy using modifications to MR Egger regression Bowden *et al.* [13]. However, for the aforementioned reasons, the resulting estimator is likely to be imprecise and useful only in a limited number of circumstances.

We introduced our two-parameter BMA model for a univariate MR analysis. Zuber *et al.* [35] have already proposed a BMA implementation of multivariable MR [34, 42], which includes an algorithm for selecting both SNPs and exposure traits. Our model can in principle be extended to multivariable MR too. For a model with 10 exposure traits, this would necessitate the estimation of 20 causal parameters to account for InSIDE violation via unmeasured pathways. This is another topic for future research.