Multivariate Statistics: Using the Right Tool for the Job

Learning Objectives

Understand when to apply multivariate tools
Understand the difference between supervised/constrained and unsupervised/unconstrained
Interpret score and loading plots
Match research questions to supervised/unsupervised analyses
Install an R package from Bioconductor
Find help for “generic” functions
Implement PCA and PLS-DA with R code

What is “multivariate” data?

Many things measured on the same sample/observation

Multiple response/dependent variables = “multivariate”
Examples:
- How is species composition impacted by burning?
- How are plant metabolites impacted by herbivory?
- How does treatment effect gene expression?

Multiple predictor/independent variables = “multivariable”
Examples:
- How does species composition effect ecosystem productivity?
- How does plant metabolite blend effect herbivory?
- How does gene expression effect disease state?

“Multivariate” is often used to describe both (Hidalgo and Goodman 2013)

Challenges of multivariate data

“Traditional” statistics can’t handle multiple response variables
Variables are often correlated and not fully independent (“multicollinearity”)
More variables than observations (“curse of dimensionality”)
How to determine importance of individual variables

Cupcakes vs. Muffins

Sample of 30 cupcake & muffin recipes scraped from allrecipes.com with frosting ingredients removed¹
40 ingredients, units standardized in US cups (although this is not necessary for multivariate data analysis)
You can think of the ingredients for each recipe like species, gene transcripts, chemicals, OTUs, or whatever fits your discipline
You can think of type (cupcake or muffin) as whatever treatment or site difference fits your discipline

The Data

baked_goods <- read_csv("https://raw.githubusercontent.com/cct-datascience/CALS-workshops/main/202302-multivariate/baked_goods.csv")
head(baked_goods)

# A tibble: 6 × 31
  type    recipe_id baking pow…¹ baking …²  butter butterm…³ chocolate
  <chr>       <dbl>        <dbl>     <dbl>   <dbl>     <dbl>     <dbl>
1 Cupcake    145206     0.00174   0.000868 0          0         0     
2 Cupcake    240140     0.000868  0.000694 0.0167     0         0     
3 Cupcake    161019     0.000868  0.00174  0          0.0417    0.0339
4 Muffin      16945     0         0        0.0741     0         0     
5 Muffin     228562     0.00116   0.00231  0.00694    0         0     
6 Cupcake    215375     0.00130   0.000651 0          0         0     
# … with 24 more variables: cornmeal <dbl>, `cream cheese` <dbl>,
#   eggs <dbl>, flour <dbl>, fruit <dbl>, `fruit juice` <dbl>,
#   honey <dbl>, margarine <dbl>, milk <dbl>, nut <dbl>, oats <dbl>,
#   oil <dbl>, other <dbl>, salt <dbl>, `sour cream` <dbl>,
#   spice <dbl>, starch <dbl>, sugar <dbl>, unitless <dbl>,
#   vanilla <dbl>, vegetable <dbl>, vinegar <dbl>, water <dbl>,
#   yogurt <dbl>, and abbreviated variable names ¹`baking powder`, …

Unsupervised/unconstrained

Exploratory
Finds axes that explain the variation in the data
Single-sided equation (no dependent/independent variables)

Supervised/constrained

Explanatory
Finds axes that best separate multivariate data along some variable (e.g. cupcakes vs. muffins)

Results: unsupervised (PCA)

Results: supervised (PLS-DA)

Use the right tool for the job

Two different questions answered:

Unsupervised: Do muffins and cupcakes differ in the ingredients that vary most among them?
Supervised: Do muffins and cupcakes differ? What ingredients make them different?

Important

Most of the time, if you are looking to test a hypothesis, you have a supervised type question unless the main axis of variation is of particular interest. E.g. “Do muffins and cupcakes tend to have different leavening systems?”

R Packages for Multivariate Analysis

There are many R packages for multivariate data analysis, but not one perfect package
vegan is a good toolkit, with extensive documentation, but the language is very specific to community ecology
ropls a good option if you’re interested in PCA or PLS and its variants (a supervised technique), but the UI is unusual and it uses S4 objects which are harder to work with
ade4 package is a good toolkit, but documentation is terse and written for someone who is already a stats wiz
If you just need principal components analysis (PCA), this can be done in base R with prcomp()

PCA with base R

Extract just the ingredients columns

library(dplyr)
ingredients <- baked_goods |> select(-type, -recipe_id)
head(ingredients)

# A tibble: 6 × 29
  baking po…¹ baking …²  butter butterm…³ chocolate cornmeal cream c…⁴
        <dbl>     <dbl>   <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1    0.00174   0.000868 0          0         0             0    0     
2    0.000868  0.000694 0.0167     0         0             0    0.0333
3    0.000868  0.00174  0          0.0417    0.0339        0    0     
4    0         0        0.0741     0         0             0    0     
5    0.00116   0.00231  0.00694    0         0             0    0     
6    0.00130   0.000651 0          0         0             0    0.0312
# … with 22 more variables: eggs <dbl>, flour <dbl>, fruit <dbl>,
#   `fruit juice` <dbl>, honey <dbl>, margarine <dbl>, milk <dbl>,
#   nut <dbl>, oats <dbl>, oil <dbl>, other <dbl>, salt <dbl>,
#   `sour cream` <dbl>, spice <dbl>, starch <dbl>, sugar <dbl>,
#   unitless <dbl>, vanilla <dbl>, vegetable <dbl>, vinegar <dbl>,
#   water <dbl>, yogurt <dbl>, and abbreviated variable names
#   ¹`baking powder`, ²`baking soda`, ³buttermilk, ⁴`cream cheese`

PCA with base R

Do PCA with prcomp()

baked_pca <- prcomp(ingredients, scale. = TRUE)

Note

It is usually advised to scale and center variables before doing multivariate analyses like PCA. You can do this in prcomp() with the scale. = TRUE argument, or before running PCA using the scale() function.

PCA with base R

Inspect results and decide how many axes to “retain”

summary(baked_pca)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     1.9737 1.8379 1.68784 1.57981 1.50967 1.41547 1.37242
Proportion of Variance 0.1343 0.1165 0.09823 0.08606 0.07859 0.06909 0.06495
Cumulative Proportion  0.1343 0.2508 0.34904 0.43511 0.51370 0.58278 0.64773
                          PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     1.2947 1.11086 1.08492 1.01494 0.96451 0.91007 0.83833
Proportion of Variance 0.0578 0.04255 0.04059 0.03552 0.03208 0.02856 0.02423
Cumulative Proportion  0.7055 0.74808 0.78867 0.82419 0.85627 0.88483 0.90907
                          PC15   PC16    PC17    PC18   PC19    PC20    PC21
Standard deviation     0.77799 0.6959 0.67063 0.52921 0.4785 0.44377 0.34799
Proportion of Variance 0.02087 0.0167 0.01551 0.00966 0.0079 0.00679 0.00418
Cumulative Proportion  0.92994 0.9466 0.96215 0.97180 0.9797 0.98649 0.99067
                          PC22    PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     0.32268 0.25521 0.23283 0.16225 0.13071 0.05266 0.03129
Proportion of Variance 0.00359 0.00225 0.00187 0.00091 0.00059 0.00010 0.00003
Cumulative Proportion  0.99426 0.99650 0.99837 0.99928 0.99987 0.99996 1.00000
                           PC29
Standard deviation     0.007944
Proportion of Variance 0.000000
Cumulative Proportion  1.000000

PCA with base R

Visualize results

# biplot
biplot(baked_pca)

#or extract scores and loadings and build your own!
scores <- baked_pca$x
loadings <- baked_pca$rotation

Now you try!

Do PCA on the Palmer penguins dataset using bill length, bill depth, flipper length, and body mass as variables of interest.

library(palmerpenguins)
head(penguins)

# A tibble: 6 × 8
  species island   bill_le…¹ bill_de…² flipper…³ body_ma…⁴ sex    year
  <fct>   <fct>        <dbl>     <dbl>     <int>     <int> <fct> <int>
1 Adelie  Torgers…      39.1      18.7       181      3750 male   2007
2 Adelie  Torgers…      39.5      17.4       186      3800 fema…  2007
3 Adelie  Torgers…      40.3      18         195      3250 fema…  2007
4 Adelie  Torgers…      NA        NA          NA        NA <NA>   2007
5 Adelie  Torgers…      36.7      19.3       193      3450 fema…  2007
6 Adelie  Torgers…      39.3      20.6       190      3650 male   2007
# … with abbreviated variable names ¹bill_length_mm, ²bill_depth_mm,
#   ³flipper_length_mm, ⁴body_mass_g

Tip

You’ll need to remove rows with NA. One way to do this is with the drop_na() function from the tidyr package. E.g.:

data |> select(col1, col2) |> drop_na()

Using the `ropls` package

ropls is a powerful R package for principal component analysis (PCA) and partial least squares regression (PLS)
Includes model diagnostics and permutation test statistics by default
Decent plots
Must install from Bioconductor
User interface is unusual and can be confusing

PCA with `ropls`

Install ropls

# install instructions copied from ropls page:
# https://bioconductor.org/packages/release/bioc/html/ropls.html

# install BiocManager
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# use bioconductor to install ropls
BiocManager::install("ropls")

PCA with `ropls`

Do PCA

There’s just one function, opls(), that does PCA, or various ‘flavors’ of PLS depending on the arguments you give it. If we give it just x data, it’ll do PCA.

baked_pca2 <- opls(ingredients)

PCA with `ropls`

PCA
30 samples x 29 variables
standard scaling of predictors
      R2X(cum) pre ort
Total    0.513   5   0

PCA with `ropls`

Make a better score plot.

How to find help for a generic function like plot()?

#figure out what the object class is
class(baked_pca2)

[1] "opls"
attr(,"package")
[1] "ropls"

#See what methods are available
methods(class = "opls")

 [1] coef         fitted       getEset      getLoadingMN getPcaVarVn 
 [6] getScoreMN   getSubsetVi  getSummaryDF getVipVn     getWeightMN 
[11] plot         predict      print        residuals    show        
[16] tested      
see '?methods' for accessing help and source code

#look for help for that plot method
?plot.opls

PCA with `ropls`

# score plot
plot(baked_pca2, typeVc = "x-score", parAsColFcVn = baked_goods$type)
# correlation plot
plot(baked_pca2, typeVc = "correlation")

PLS-DA with `ropls`

Partial Least Squares regression (PLS or PLSR) is a supervised multivariable technique
Handles multicollinearity (correlated predictors) and more variables than observations exceptionally well
When \(Y\) is categorical (e.g. cupcake or muffin), it’s PLS discriminant analysis (PLS-DA)

PLS-DA with `ropls`

Try ordinary PLS-DA

baked_pls <- opls(ingredients, baked_goods$type)

PLS-DA with `ropls`

PLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
Total     0.11    0.742   0.521  0.26   1   0 0.05 0.05

Warning: Single component model: only 'overview' and 'permutation' (in case of
single response (O)PLS(-DA)) plots available

Orthogonal PLS-DA

Orthogonal PLS-DA (OPLS-DA) finds a single predictive axis that separates 2 categories
Other axes are orthogonal—they’re not related to difference between cupcakes & muffins (e.g.)
Usually not recommended, but in the case of a single predictive axis with PLS-DA, it allows us to visualize results

OPLS-DA with `ropls`

Try OPLS-DA

baked_opls <- opls(ingredients, baked_goods$type, predI = 1, orthoI = NA)

OPLS-DA with `ropls`

OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
Total    0.189    0.832    0.58 0.214   1   1 0.05 0.05

OPLS-DA with `ropls`

Interpret statistics

OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
Total    0.189    0.832    0.58 0.214   1   1 0.05 0.05

\(R^2_X\)(cum) = How much variation in ingredients is explained by the axes
\(R^2_Y\)(cum) = How much of the response (cupcake vs. muffin) is explained by the axes
\(Q^2\)(cum) = Predictive power of the model—should be close to R2Y, but will always be lower

OPLS-DA with `ropls`

Interpret statistics

OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
Total    0.189    0.832    0.58 0.214   1   1 0.05 0.05

RMSEE = Square root of the mean error between the actual and the predicted responses
pre, ort = Number of predictive and orthogonal axes
pR2Y, pQ2 = p-values from permutation tests

OPLS-DA with `ropls`

Increase number of permutations

baked_opls <- 
  opls(ingredients, baked_goods$type, predI = 1, orthoI = NA, permI = 999)

OPLS-DA with `ropls`

OPLS-DA
30 samples x 29 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort    pR2Y   pQ2
Total    0.189    0.832    0.58 0.214   1   1 0.00501 0.001

OPLS-DA with `ropls`

Customize plot

# score plot
plot(baked_opls, typeVc = "x-score")
# correlation plot
plot(baked_opls, typeVc = "correlation")
# Or build your own from scores and loadings
scores <- bind_cols(
  getScoreMN(baked_opls, orthoL = FALSE),
  getScoreMN(baked_opls, orthoL = TRUE)
) |> add_column(type = baked_goods$type)
loadings <- getLoadingMN(baked_opls) |> as_tibble(rownames = "ingredient")

Now you try

With the penguins dataset, try re-doing PCA using the ropls package. In what ways is the result the same? In what ways is it different?
Now try using PLS-DA to determine if penguins differ in morphology by species¹

Getting Help

CCT Data Science drop-in hours: datascience.cct.arizona.edu/drop-in-hours
Email us: cct-datascience@arizona.edu
Book an appointment: datascience.cct.arizona.edu/people
Data Science Incubator program: datascience.cct.arizona.edu/cct-data-science-incubator

References

Hidalgo, Bertha, and Melody Goodman. 2013. “Multivariate or Multivariable Regression?” American Journal of Public Health 103 (1): 39–40. https://doi.org/10.2105/ajph.2012.300897.

Scott, Eric R. 2019. “Cupcakes Vs Muffins: Round 2.” www.ericrscott.com/post/cupcakes-vs-muffins-round-2/.

Multivariate Statistics: Using the Right Tool for the Job

Learning Objectives

What is “multivariate” data?

Challenges of multivariate data

Cupcakes vs. Muffins

The Data

Unsupervised/unconstrained

Supervised/constrained

Results: unsupervised (PCA)

Results: supervised (PLS-DA)

Use the right tool for the job

R Packages for Multivariate Analysis

PCA with base R

PCA with base R

PCA with base R

PCA with base R

Now you try!

Using the ropls package

PCA with ropls

PCA with ropls

PCA with ropls

PCA with ropls

PCA with ropls

PLS-DA with ropls

PLS-DA with ropls

PLS-DA with ropls

Orthogonal PLS-DA

OPLS-DA with ropls

OPLS-DA with ropls

OPLS-DA with ropls

OPLS-DA with ropls

OPLS-DA with ropls

OPLS-DA with ropls

OPLS-DA with ropls

Now you try

Getting Help

Further Reading

References

Using the `ropls` package

PCA with `ropls`

PCA with `ropls`

PCA with `ropls`

PCA with `ropls`

PCA with `ropls`

PLS-DA with `ropls`

PLS-DA with `ropls`

PLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`

OPLS-DA with `ropls`