An alternative modeling strategy: Partial Least Squares
Link to the last RSS article here: Examination of Cross Validation techniques and the biases they reduce. -- Ed.
By Dr. Jon Starkweather, Research and Statistical Support Consultant
Partial Least Squares (PLS) modeling is often used as an alternative to traditional modeling techniques. Unlike traditional modeling techniques which rely upon covariance decomposition, PLS is a variance based (or components based) technique and does not carry with it many of the assumptions of covariance methods (i.e. distributional assumptions). It is sometimes considered an analysis of last resort because large samples are not as necessary with it, and PLS is less sensitive to multicollinearity. However, PLS is primarily descriptive when used with small samples and is still constrained with respect to making inferences about parameters when sample sizes are small. The benefit of having the ability to do descriptive analysis with small samples is that PLS can fit models with non-linear relationships and non-Gaussian distributions among the variables in addition to the traditional linear and Gaussian situations.
PLS is also quite versatile; it can be used as a regression technique, a principal components technique, a canonical correlation technique, or a path modeling (or structural equation modeling) technique. It is well documented that PLS is biased because the optimization is local rather than global level; however, as sample size increases PLS becomes less bias. PLS can be used to make inferences about parameters when sample sizes are large. PLS is often used when other methods fail (i.e. a slightly biased estimate is better than no estimate).
As an example, we will first model a simulated data set using traditional modeling techniques using a popular method and package. John Fox’s (2010) package 'sem' is one of the more established modeling packages in R and will be used here to demonstrate how certain data sets do not converge on a specified model.
First, import the data from the internet and run the ubiquitous ‘head’ function to get a look at the data. The example data contains 20 variables (v1 – v20) and 1000 cases. Here we will name the data ‘pls.data’.
Next, create a covariance matrix object which will be passed on to the ‘sem’ function. The covariance object is named ‘cov.m’ (some of the matrix in the image below is not shown).
Next, load the ‘sem’ package by typing: library(sem) in the R console. Then, specify the sem measurement model (i.e. confirmatory factor model). The model specification syntax is given below (not in an image) due to its length.
measurement.model <- specify.model()
F1 -> v1, lam11, NA
F1 -> v2, lam12, NA
F2 -> v3, lam21, NA
F2 -> v4, lam22, NA
F2 -> v5, lam23, NA
F3 -> v6, lam31, NA
F3 -> v7, lam32, NA
F3 -> v8, lam33, NA
F3 -> v9, lam34, NA
F3 -> v10, lam35, NA
F3 -> v11, lam36, NA
F4 -> v12, lam41, NA
F4 -> v13, lam42, NA
F4 -> v14, lam43, NA
F4 -> v15, lam44, NA
F5 -> v16, lam51, NA
F5 -> v17, lam52, NA
F5 -> v18, lam53, NA
F5 -> v19, lam54, NA
F5 -> v20, lam55, NA
v1 <-> v1, var1, NA
v2 <-> v2, var2, NA
v3 <-> v3, var3, NA
v4 <-> v4, var4, NA
v5 <-> v5, var5, NA
v6 <-> v6, var6, NA
v7 <-> v7, var7, NA
v8 <-> v8, var8, NA
v9 <-> v9, var9, NA
v10 <-> v10, var10, NA
v11 <-> v11, var11, NA
v12 <-> v12, var12, NA
v13 <-> v13, var13, NA
v14 <-> v14, var14, NA
v15 <-> v15, var15, NA
v16 <-> v16, var16, NA
v17 <-> v17, var17, NA
v18 <-> v18, var18, NA
v19 <-> v19, var19, NA
v20 <-> v20, var20, NA
F1 <-> F2, cov1, NA
F1 <-> F3, cov2, NA
F1 <-> F4, cov3, NA
F1 <-> F5, cov4, NA
F2 <-> F3, cov5, NA
F2 <-> F4, cov6, NA
F2 <-> F5, cov7, NA
F3 <-> F4, cov8, NA
F3 <-> F5, cov9, NA
F4 <-> F5, cov10, NA
F1 <-> F1, NA, 1
F2 <-> F2, NA, 1
F3 <-> F3, NA, 1
F4 <-> F4, NA, 1
F5 <-> F5, NA, 1
Next, we run the measurement model; but unfortunately, it does not converge.
So, we detach the ‘sem’ package using the following command: detach(“package:sem”) and decide to use a PLS strategy. The ‘plspm’ package (PLS Path Modeling; Sanchez & Trinchera, 2010) provides functions for conducting and graphing a variety of PLS techniques; such as PLS regression with a single outcome, PLS canonical correlation, PLS regression with multiple outcomes (similar to canonical correlation, but with directionality implied between the two composite variates), PLS principal components analysis, and PLS path modeling (i.e. SEM).
PLS Path Modeling
Load the package (which three dependencies [amap, diagram, shape]).
First, we must create a matrix which expresses the inner (structural) model; this model simply shows the relationships among the latent variables; where the column variable 'causes' the row variable(s) if a 'one' is in the intersecting cell (e.g. f1 and f2 cause f3 --> columns 1 and 2 cause row 3).
Next, create the list which expresses the outer (measurement) model; this model simply shows the relationships between the manifest variables and the latent variables (e.g. variables v1 and v2 are related to the first factor [f1]). Although we create a list object in R, this is often referred to as the outer matrix in the PLS literature.
Next, create a vector which identifies the "mode" of indicators which were used (i.e. "A" for reflective measurement or "B" for formative measurement). Recall, 'Reflective' measurement is said to occur when each manifest variable is "caused by" a latent variable and 'Formative' measurement is said to occur when each manifest variable "causes" the latent variable. Below, all 5 latent variables in our model are "reflectively" measured (i.e. each latent causes the observed scores on the manifest variables).
Finally, we can run the Partial Least Squares Path Model. One of the benefits of using the ‘plspm’ package rather than one of the other PLS packages available in R, is that the ‘plspm’ package offers some very easy to use and interpret output. Each function provides a description of the function’s output items and shows how to extract or reference them.
Using the ‘summary’ function on a ‘plspm’ object provides a well-documented and indexed summary of the analysis’ output. Below you can see that the current summary provides a very thorough summary with labels for each element which makes interpretation very straighforward. In fact, the output (from the ‘summary’) is so large that is necessitates four screen capture images to display it all here.
Another big advantage to using the ‘plspm’ package (rather than others available for PLS modeling) is the ability to produce a path diagram based on the model fitted.
Another advantage to using the ‘plspm’ package is the ability to conduct bootstrapped cross validation of a PLS path model using the ‘boot.val’ optional argument to the ‘plspm’ function.
Notice in the above table, there is a “$boot” element in the output. The rest of the output is identical to what was displayed above. The “$boot” element contains the cross validation output, which is the only part of the output displayed below.
Interpretation was excluded from this article because the output of the functions covered is considered fairly intuitive. However, if one would like more information on interpreting PLS models, see Chin (2010).
Until next time, I’ll drive my Chevy to the leeve..
References & Resources
Chin, W. W. (2010). How to write up and report PLS analyses. In Esposito, V., et al. (eds.), Handbook of Partial Least Squares (pp. 655 – 688). New York: Springer-Verlag.
Diamantopoulos, A., & Siguaw, J. A. (2006). Formative versus reflective indicators in organizational measure development: A comparison and empirical illustration. British Journal of Management, 17, 263 – 282.
Falk, R. F., & Miller, N. B. (1992). A primer for soft modeling. Akron, OH: University of Akron Press.
Garson, D. (2011). Partial Least Squares. Statnotes. Accessed May 9, 2011; from: http://faculty.chass.ncsu.edu/garson/PA765/pls.htm
Haenlein, M., & Kaplan, A. (2004). A beginner's guide to partial least squares analysis. Understanding Statistics, 3(4), 283 -- 297. Available at: http://www.stat.umn.edu/~sandy/courses/8801/articles/pls.pdf
Lohmoller, J. (1989). Latent variable path modeling with partial least squares. New York: Springer-Verlag.
Marcoulides, G. A., & Saunders, C. (2006). PLS: A silver bullet? MIS Quarterly, 30, iii – ix.
Sanchez, G. (2010). Package 'plspm'. Available at CRAN: http://cran.r-project.org/web/packages/plspm/index.html
Tenenhaus, M., Vinzi, V. E., Chatelin, Y., & Lauro, C. (2005). PLS path modeling. Computational Statistics & Data Analysis, 48, 159 -- 205. Available at: www.sciencedirect.com
Trinchera, L. (2007). Unobserved heterogeneity in structural equation models: A new approach to latent class detection in PLS path modeling. Doctoral dissertation. Available at: http://www.fedoa.unina.it/view/people/Trinchera,_Laura.html