Near Infrared Transmitance Tutorial (part 3)

Let´s check the correlation between the outcomes, between the predictors and between outcomes and predictors
R
NIT Tutorial
Collinearity
Author

José Ramón Cuesta

Published

September 29, 2022

Organizing data

Let´s start loading the libraries we are going to use in this post and the workspace from the previous post.

library(tidyverse)
library(corrplot)
load("C:/BLOG/Workspaces/NIT Tutorial/NIT_ws2.RData")
ls()
[1] "absorp"       "endpoints"    "meats"        "meats_longer" "tecator"     

Correlation between the parameters (outcomes)

It is important to check the correlation between the outcomes variables (Moisture, Fat and Protein in this case). As we remember the outcomes are in the “endpoints” matrix.

endpoints %>% 
  cor()
           Moisture        Fat    Protein
Moisture  1.0000000 -0.9881002  0.8145212
Fat      -0.9881002  1.0000000 -0.8608965
Protein   0.8145212 -0.8608965  1.0000000

The correlation matrix show how the are high correlations between the three parameters, but we can use graphics to see it more clearly:

corrplot(cor(endpoints), method = "ellipse")

Figure 1: Inter-correlation between parameters.

But to check better the trends and the possible outliers, we can use the function “pairs

pairs(endpoints)

Figure 2: Parameters inter-correlation

This way we can see at least six outliers that seems to have some bias versus the linear trend. What are the reason for these outliers? (samples different from the rest, different reference method, bias in the reference method,…). We will try to see along the tutorial.

Correlation between the wavelengths (predictors)

It is known that the NIR or NIT predictor variables are highly correlated, due that we are working with overtones and combination bands, so the correlation matrix in this case show high correlation between all the variables, due to we are in the third overtone and working with very broad bands.For this reason we have to apply math treatments to the spectra to remove the scatter effect and derivatives to improve the resolution of the bands. The correlation between predictors is a long matrix (100.100), so the best way to see it it is graphically. By now we see the correlation matrix of the raw spectra (without any math treatment)

corrplot(cor(absorp), method = "circle", type = "upper", tl.cex = 0.5)

Figure 3: Spectra collinearity.

Correlation between outcomes and predictors

Another point is how the variation in the predictors matrix correlates with the variation of the outcomes. What we do is to see which wavelengths correlate better with a certain parameter getting three correlation spectra (one for every parameter).

cor_rawspec_moi <- cor(tecator$Moisture, tecator$spec)
cor_rawspec_fat <- cor(tecator$Fat, tecator$spec)
cor_rawspec_prot <- cor(tecator$Protein, tecator$spec)

cor_rawspec <- as.data.frame(rbind(cor_rawspec_moi, cor_rawspec_fat, cor_rawspec_prot))
cor_rawspec <- cor_rawspec %>% 
  mutate(Parameter = as.factor(c("Moisture", "Fat", "Protein")))

cor_rawspec %>% 
  pivot_longer(cols = c(1:100), names_to = "Wavelength", values_to = "Correlation") %>% 
  mutate(Wavelength = as.integer(Wavelength)) %>% 
  ggplot(aes(x = Wavelength, y = Correlation, group = Parameter, col = Parameter)) +
  geom_line()

Figure 4: Correlation absorbance vs. parameters.

As we can see there are no wavelengths with high correlations, and if we would auto-scale every correlation spectrum, the spectrum would seem as a meat sample spectra (for moisture and protein inverted). All this is due to the scatter physical effects. So, with some math treatments to remove it the correlation will improve. Anyway, in the third overtone due to the bands overlapping, we would need a multivariate calibration with all or almost all the wavelengths.

As always save the workspace for future use:

save.image("C:/BLOG/Workspaces/NIT Tutorial/NIT_ws3.RData")