Sharpening Occam’s Razor: Using Bayesian Model Averaging in R to Separate the Wheat from the Chaff
Link to the last RSS article here: Bayesian Generalized Linear Models in R -- Ed.
By Dr. Jon Starkweather, Research and Statistical Support Consultant
Bayesian Model Averaging (BMA) is a method of variable selection which quantifies the value of multiple models so that the analyst can select the most appropriate model for a given outcome variable. The metrics used for comparison of competing models are the Bayesian Information Criterion (BIC; Schwarz, 1978) and the posterior probability (of a particular model being the correct model). The best model, displays the lowest BIC (e.g. a BIC of -121.00 is preferred over a BIC of 21.00) and the highest posterior probability. In the simplest situation (linear regression), each model is characterized by a group of predictors for the outcome variable. When BMA is applied to all available predictors, and given an outcome variable of interest, it produces a posterior distribution of the outcome variable which is a weighted average of the posterior distributions of the outcome for each likely model (Raftery, Painter, & Volinsky, 2005). Essentially, BMA is used to determine which predictors should be included in a regression model or general linear model (GLM), or extensions of the GLM (e.g. generalized linear models and survival or event history analysis). BMA is particularly useful when a large number of proposed predictors have been measured (e.g. 20, 30, or 40).
BMA is accomplished in the R programming language environment using the BMA package (Raftery, Hoeting, Volinsky, Painter, & Yeung, 2010). The function bicreg is used in the regression situation while the bic.glm function is used in the GLM and generalized linear modeling situations. The bic.surv function is used for survival or event history analysis; which will not be covered in this article. These functions “do an exhaustive search over the model space using the fast leaps and bounds algorithm” (Raftery, et al., 2005, p. 2). The leaps and bounds algorithm (Furnival & Wilson, 1974) allows these functions to return a set of the best models rather than all possible models.
The first example involves a fictional data set which contains the outcome variable extroversion (extro) and 12 possible predictors; openness (open), agreeableness (agree), social engagement (social), cognitive engagement (cognitive), physical engagement (physical), cultural engagement (cultural), vocabulary (vocab), abstruse analogies (abstruse), block design (block), common analogies (common), letter sets (sets), and letter series (series). All 13 variables are assumed to be interval scaled. There are 750 cases in the data set, with no missing values.
First, import the data from the web using the foreign package, because the data file is in SPSS format. Then get a summary of the data, if desired, using the summary function.
Next, load the BMA package which contains the functions necessary for Bayesian Model Averaging. Note that there are three dependencies.
The bicreg function is used in the linear regression situation. However, the function requires a matrix of the possible predictor variables, so we must first create such a matrix. Using the attach function allows us to reference the variables by name directly (as opposed to using the tedious $ operator, e.g. data.1$open). The head function simply lists the first 6 elements of an object.
Now we can submit the bma function by simply assigning it to a named object (here: bma1) and supplying it with the matrix of predictors and the outcome variable (data.1$extro). We can use the common summary function to summarize the results of the bicreg function.
The column “p!=0” indicates the probability that the coefficient for a given predictor is NOT zero, among the 25 models returned. The column “EV” displays the BMA posterior distribution mean for each coefficient and the column “SD” displays the BMA posterior distribution standard deviation for each coefficient. Only the five best models are displayed. We can see that the first model “model 1” (which includes only open, agree, & series) is the best because it has the lowest BIC and the largest posterior probability (of being the correct model). Notice, at the bottom of each model column, the number of predictors and R² value is displayed. Generally, the first model (Model 1) is the best model; however, it may be the case that theory dictates the inclusion of some variables which were excluded by the first model. For each variable included in a given model, the coefficient (or parameter value) for that variable is given (e.g. Model 1, open coefficient = 0.37028). Remember that the substantive interpretation of each coefficient is, for instance: for a one unit change in open (predictor), there would be a corresponding change of 0.37028 in extro (outcome), based on Model 1.
The Ordinary Least Squares (OLS) part of the output (not printed by default) gives a matrix, with each model as a row and each predictor as a column; listing the estimated (OLS) coefficient for each variable in a given model (of all 25 models returned). The OLS output can be accessed using the $ naming convention (e.g. bma1$ols). The output below has been cut off at the right edge to save space in this article.
Notice, both open and agree display fairly stable estimated coefficient values across all 25 models, this is why they both have a “p!=0” value of 100% (indicating that their coefficient is NOT zero 100% of the time among these models).
The standard errors for the above estimated coefficients can be retrieved using the se argument (e.g. bma1$se). Again, the output below has been cut off at the right edge to save space in this article.
The postmean part of the output (printed with summary in the “EV” column) contains the average posterior coefficient for each predictor and the postsd provides the standard deviation of each average posterior coefficient.
The which part of the output (not provided with the summary) contains a matrix, with each model as a row and each predictor variable as a column; listing (TRUE or FALSE) whether a variable was included in each model.
The BMA package also contains a plot function for displaying the posterior distributions of each coefficient; in this example the density plots are displayed in 5 rows and 3 columns.
Notice, among the density plots, each variable which is of little importance contains a spike at 0.0. These are the variables which are least influential to the outcome variable (e.g. social); their coefficients are centered on, and most likely are, zero.
For a complete description of the bicreg function simply type help(bicreg) in the R console once the BMA package is loaded.
We can use the same data and predictors from above to illustrate the application of BMA to a GLM situation using the bic.glm function. The bic.glm function can accept a matrix of predictors and the outcome variable (as above with the bicreg function), or the formula can be specified directly (e.g. extro ~ open + agree + social + cognitive … series). The bic.glm function also accepts the glm.family argument to specify non-Gaussian models (e.g. Poisson, binomial, etc.).
Notice when using bic.glm and specifying “Gaussian” the estimation of the posterior means and standard deviations are slightly different from what was observed with the bicreg function. Below the means and standard deviations from the bicreg and bic.glm functions are displayed; bma1 and bma2 respectively.
The plot function also works with bic.glm objects; here displaying the density plots in 4 rows and 3 columns.
Notice again, the variables which do not contribute to the outcome variable are shown with spikes at zero in their density plots; indicating that their coefficients are most likely zero.
For a complete description of the bic.glm function simply type help(bic.glm) in the R console once the BMA package is loaded.
Binomial Generalized Linear Model Example
The binomial generalized linear model is the logistic (logit) model. The bic.glm function is used, simply specifying the binomial glm.family argument as would be done with the standard glm function.
This example uses a simulated data set which contains one binary outcome variable (y = 0 or 1) and four interval predictor variables (x1, x2, x3, x4). There are 400 cases of data with no missing values. The data also contains a code variable which simply identifies each case. The data file is a space delimited text (.txt) file, so the foreign package is not necessary for importing it into R.
As mentioned above, when using the bic.glm function, one can either create a matrix of predictor variables or simply specify the formula directly. Above we used the matrix approach; here we will specify the formula directly. Of course, here we also specify the glm.family as binomial.
Although the summary of the bic.glm object here is interpreted the same way as the previous two examples (in terms of model/variable importance using BIC and posterior probability), it is important to remember that the coefficients for each predictor here (binomial setting) are NOT interpreted in the same way as they would be in the Gaussian setting(s).
Remember, when interpreting coefficients in a logistic (binomial) setting, the values are interpreted as changes in the logit. The logistic coefficient is the expected amount of change in the logit for each one unit change in the predictor. The logit is what is being predicted; it is the odds of membership in the category of the outcome variable with the numerically higher value (here a 1, rather than 0). The closer a logistic coefficient is to zero, the less influence it has in predicting the logit.
The plot function works the same with way with binomial models as it did with the above models.
The plots simply confirm what was expressed in the summary function, x1 has virtually nothing to contribute to y and x4 has a moderate influence on y.
For a complete description of the different families available to the glm function (and the bic.glm function), type help(“family”) in the R console.
Keep in mind, there are other packages available for conducting BMA in R. Perhaps most notable, is the mlogitBMA package which offers extensions to the bic.glm function so that BMA can be applied in the multinomial logistic situation. Other packages which incorporate BMA include: BAS, BMS, and ensembleBMA.
An Adobe.pdf version of this article can be found here.
References / Resources
Brown, P. J., & Vannucci, T. F. (2002). Bayes model averaging with selection of regressors. Journal of the Royal Statistical Society (Series B: Statistical Methodology), 64(3), 519 -- 536.
Furnival, G. M., & Wilson, R. W. (1974). Regression by leaps and bounds. Technometrics, 16(4), 499 -- 511.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382 -- 401.
Kass, R. E., & Raftery, A. E. (1995). Bayes Factors. Journal of the American Statistical Association, 90(430), 773 -- 795.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111 -- 163.
Raftery, A. E. (1996). Approximate Bayes factors and accounting for model uncertainty in generalized linear models. Biometrika, 83(2), 251 -- 266.
Raftery, A. E., Painter, I. S., & Volinsky, C. T. (2005). BMA: An R package for Bayesian Model Averaging. R News, 5(2), 2 - 8. Available at: http://journal.r-project.org/archive.html
Raftery, A. E., Hoeting, J., Volinsky, C. Painter, I., & Yeung, K. Y. (2010). Package ‘BMA’. Available at: http://cran.r-project.org/web/packages/BMA/index.html
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461 -- 464.
Wang, D., Zhang, W., & Bakhai, A. (2004). Comparison of Bayesian model averaging and stepwise methods for model selection in logistic regression. Statistics in Medicine, 23, 3451 -- 3467.
Until next time, It ain’t me, it ain’t me; I ain’t no Senator’s son…