Bayesian Generalized Linear Models in R
Link to the last RSS article here: Linear Mixed Effects Modeling using R -- Ed.
By Dr. Jon Starkweather, Research and Statistical Support Consultant
Bayesian statistical analysis has benefited from the explosion of cheap and powerful desktop computing over the last two decades or so. Bayesian techniques can now be applied to complex modeling problems where they could not have been applied previously. It seems likely that the Bayesian perspective will continue to challenge, and perhaps sub-plant, traditional frequentist statistical methods which have dominated many disciplines of science for so long. The current article will review one function which allows the user to conduct linear regression, general linear modeling, and generalized linear modeling (i.e. non-Gaussian; e.g., Poisson, binomial, etc.). A fairly simple model is specified, then modeled using traditional techniques, and then modeled with a Bayesian approach. Do not implement these methods unless you understand the core principles of the Bayesian perspective (i.e. priors, likelihoods, posteriors, etc., and all they entail).
A complete treatment of the Bayesian perspective is beyond the scope of this article and could fill several books; and has. Interested readers can consult a number of introductory texts focusing on the Bayesian perspective (e.g., Berry, 1996; Bolstad, 2004; Gelman, Carlin, Stern, & Rubin, 2004; Hoff, 2009). Very generally speaking, the Bayesian approach to statistical inference differs from traditional frequentist inference by assuming that the data are fixed and model parameters are random, which sets up problems in the form of; what is the probability of a hypothesis (or parameter), given the data at hand? These types of problems can be stated with symbols as: p(H|D). Traditional frequentist inference assumes that the model parameters are fixed (though unknown) and the data are essentially random; for instance, if the null hypothesis is true, what is the probability of this data? These types of problems can be stated in the general form; what is the probability of the data given a hypothesis? In symbols, this translates to: p(D|H).
Bayesian methods focus on five essential elements. First, the incorporation of prior information (e.g., expert opinion, a thorough literature review of the same or similar variables, and/or prior data). Prior information is generally specified quantitatively in the form of a distribution (e.g., normal/Gaussian, Poisson, binomial, etc.) and represents a probability distribution for a coefficient; meaning, the distribution of probable values for a coefficient we are attempting to model (e.g., β weight). It may help to think of the prior as an educated best guess. Second, the prior is combined with a likelihood function. The likelihood function represents the data (i.e. what is the distribution of the estimate produced by the data). Third, the combination of the prior with the likelihood function results in the creation of a posterior distribution of coefficient values. Fourth, simulates are drawn from the posterior distribution to create an empirical distribution of likely values for the population parameter. Fifth, basic statistics are used to summarize the empirical distribution of simulates from the posterior. The mode (or median or mean) of this empirical distribution represents the maximum likelihood estimate of the true coefficient's population value (i.e. population parameter) and credible intervals can capture the true population value with probability attached.
Keep in mind; priors should be rationally and honestly derived. They can be weak or strong. These terms refer to the strength of belief we have in the prior(s). Weak priors result when we do not have a great deal of evidence or prior information on which to base the prior(s). When the prior is weak, the prior distribution will be wide, reflecting a great many possible values and the likelihood will be more influential in creating the posterior distribution. Strong priors, conversely, result when we have a great deal of evidence on which to base the prior(s). When the prior is strong, the prior distribution will be narrow, reflecting a smaller range of possible values and the likelihood will be less influential in creating the posterior (strong priors will influence the posterior more than the likelihood). It should be clear the one key feature of the prior is the ability to quantify our uncertainty. The posterior can be thought of as a compromise between the prior and the likelihood. If the prior is weak, then it will be less influential in creating the posterior; if the prior is strong, then it will be more influential in creating the posterior.
The example used here is a simple linear regression model with one interval/ratio outcome (extro) and three interval/ratio predictors (open, agree, social). The simulated data set contains 500 cases, each with complete data (i.e. no missing values).
Import the data from the web, get a summary of the data, and take a look at the correlations. We see very little multicollinearity here.
Confirm / take a look at the core Linear Model (lm) -- traditional Ordinary Least Squares (OLS) regression.
Notice in the output, the intercept is approximately -5.0. The unstandardized coefficients for each predictor are listed. The open coefficient is approximately 1.2, the agree coefficient is .90, the social coefficient is approximately 0.40. We could calculate a 95% confidence interval (CI95) for each predictor's coefficient; for instance, the CI95 for open is 1.187 to 1.212. But what does this really tell us? Well, it is interpreted as: if an infinite number of samples were taken from this population, 95% of the open coefficient values would be between 1.187 and 1.212. But it does not tell us the range which contains the true population value.
The same results are below; but, the results below were generated with the Generalized Linear Model (glm) function, specifying the default Gaussian (normal) family distribution. The primary benefit of the `glm’ function is the ability to specify error distributions other than normal.
To conduct the Bayesian GLM, load the package `arm’ which contains the `bayesglm` function (Gelman, et al., 2010). You will notice there are several dependencies.
Conduct the Bayesian Generalized linear model (here family = Gaussian) and get the summary of the output. Notice the specification of the prior mean, scale, and degrees of freedom. Each `family' of distributions requires specific prior specifications (e.g. a binomial distribution would have slightly different prior specification; see the package documentation for details or simply type `help(bayesglm)’ in the R console).
We can see the output matches up with the traditional linear model (OLS regression) as well as the traditional GLM. As sample sizes increase the results should converge to the same values.
One of the benefits of the Bayesian perspective (for any analysis) is that it allows us to make credible interval statements. Credible intervals are similar to confidence intervals, but in the Bayesian framework, the interval REALLY IS believed to contain the true population parameter. For instance: a 95% credible interval for a parameter being estimated is interpreted as; there is a 95% probability that the actual parameter is included in that interval. This is because the interval is based on information from the posterior distribution; of for instance, one of the predictor's coefficient posterior distribution (e.g. the open variable's coefficient posterior distribution).
The `bayesglm’ function represents a kind of short cut of the Bayesian approach to inference. Typically, the posterior is not used directly for making inferences. Instead, an empirical distribution is constructed based on draws from the posterior and that empirical distribution is what informs the inference(s). Here, we are using the `bayesglm’ as a proxy for doing the added empirical distribution. With the `bayesglm’ we get a distribution of `simulates’ which are used in place of an actual empirical distribution (which will be covered further below).
Retrieve the posterior distributions of the coefficients for the intercept and all three predictors. The `head’ function simply lists the first 10 rows of the object on which it is run (the default `head’ is the first 6).
Extract just the posterior distribution of the 'open’ variable's coefficient. Again, the `head’ function simply lists the first 10 items of the object.
Take a look at the posterior distribution of the open variable's coefficient (normally a histogram would not be used, it is used here simply as a graphical reference).
The `density’ plot is normally used to display a posterior. The density plot can be thought of as a highly modified histogram. Imagine a histogram with 100 bins (instead of the 7 as displayed in the histogram above), then imagine plotting a line from the x-axis through each bin's midpoint at the top of each bin, and then back down to the x-axis; the result would be the density plot.
Now we can retrieve the 95% credible interval for the open variable's coefficient.
Recall, this credible interval is interpreted as; there is a 95% probability that the true population value of the open coefficient is between 1.186 and 2.210. Keep in mind, these numbers will fluctuate slightly based on the iterative nature of the function.
To make truly Bayesian inferences about our coefficients, we need to do the extra step of creating the empirical distribution(s) mentioned above. Going further entails actually creating an empirical distribution based on iterative draws from the posterior. The `MCMCregress’ function in the package `MCMCpack’ (Martin, Quinn, \& Park, 2010) provides us with the Markov Chain Monte Carlo simulation method of creating the empirical distribution; which itself allows us to then compute the descriptive statistics used for inference. Meaning, the mode, median, or mean of the empirical MCMC simulates' distribution is the 'maximum likelihood' estimate (i.e. top of a density function) of the population parameter. The `MCMCregress’ function also gives us the credible interval which includes the actual population parameter value. MCMCpack also contains functions for many other types of models and contains other ancillary functions for working with MCMC objects.
First, load the `MCMCpack’ library.
Next, apply the `MCMCregress’ function. Notice, the model formula is the same, but here we have some new options. The 'burnin’ argument is used because MCMC iterates are sensitive to their initial start values, so the first few (i.e. 3000) iterations are discarded. The 'mcmc’ simply issues how many (post-burnin) iterations will be used to build the empirical distribution. The 'thin’ defaults to 1 and represents a control on convergence, such that once approximate convergence has been reached it can be beneficial to keep only a few simulates and discard the rest to conserve computer resources (Gelman, Carlin, Stern, & Rubin, 2004). The verbose option (by default is off) simply does or does not print the iteration history as the function runs. The seed argument simply allows the user to set the random number generator seed. The 'beta.start’ argument allows the user to set a start value for the beta vector.
Notice in the summary, we get the coefficient estimates (“1. Empirical...”) and credible intervals (“2. Quantiles...”). So, we can say there is a 95% probability that the true population value of the open coefficient is between 1.1869 and 2.2120.
An Adobe.pdf copy of this article can be found here.
References & Resources
Albert, J. (2007). Bayesian Computation with R. New York: Springer Science+Business Media, LLC.
Berry, D. A. (1996). Statistics: A Bayesian perspective. Belmont, CA: Wadsworth Publishing Company.
Bolker, B. M. (2008). Ecological Models and Data in R. Princeton, NJ: Princeton University Press.
Bolstad, W. M. (2004). Introduction to Bayesian statistics. Hoboken, NJ: John Wiley & Sons, Inc.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian Data Analysis (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. (2009). A Weakly Informative Default Prior Distribution For Logistic And Other Regression Models. The Annals of Applied Statistics, 2(4), 1360-1383.
Gelman, A., Su, Y., Yajima, M., Hill, J., Pittau, M. G., Kerman, J., & Zheng, T. (2010). Package 'arm'. Available at: http://cran.r-project.org/web/packages/arm
Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. New York: Springer Science+Business Media, LLC.
Martin, A. D., Quinn, K. M., & Park, J. H. (2010). Package 'MCMCpack'. Available at: http://cran.r-project.org/web/packages/MCMCpack/MCMCpack.pdf
Until next time, ``Stop; hey, what's that sound...''