Principal component analysis (PCA)

PCA is one of the most often used dimension reduction techniques in data science. It is extensively used for projection of high dimensional data onto low dimensions which can be used for data visualization (usually in 2D) and to obtain non-correlated (orthogonal) features ordered according to the amount of variance explained by each of them. This article uses PCA to visualize 150 images of digits represented as 28X28 pixel image, onto 2D coordinate system. During this I am explaining the theory of PCA and different algorithms that can be used to carry-out this analysis in R. The goal of this article is to see PCA from different angles and doing so will give us clear picture of this analysis in R.

Theory

PCA uses various theories from linear algebra and fundamental knowledge of linear algebra, and its implementation in manipulation of data matrix is a most to understand PCA. You might want to read my previous post on linear algebra if you need a quick recap of these concepts.

Goal:

When a data consists of \(p\) number of features and \(n\) number of observations, it can be represented as a matrix of dimensions \(n \times p\). Each observations is a vector of \(p\) dimensions and can be plotted on p-dimensional space. \(p\) is usually very large compared to \(n\) and it is desirable to represent data in a lower dimensions, but we will loose some information whenever this data is represented in any dimensions less than \(p\). If we can find new dimensions such that the first dimension will explain the most of the variance in the data and second dimension the second most and so on, we can ignore the dimensions towards the end as they contribute very little information to the data. So, the whole data can be represented with only \(k\) dimensions which is much smaller than \(p\), with very little loss of the information. It is much easier to understand this concept when we see the reduction of a 2-dimensional space to 1-dimensional space.

The matrix below (\(A\)) has five observations represented by 2 features (for example student by weight and length) and we want reduce this to lower dimension (1D) by first finding two dimensions where first will explain the most variance (1st principal component) and second will explain the remaining variance. If we represent our whole data with just first principal component (which is a straight line), there will be minimum loss of information. This is very similar to a linear regression but goal here is dimension reduction not finding a equation for dependent variable in terms of independent variables.

A = matrix(c(100,90,
             120,115,
             122, 140,
             130,150,
             140,155), byrow = T, ncol=2)
A

##      [,1] [,2]
## [1,]  100   90
## [2,]  120  115
## [3,]  122  140
## [4,]  130  150
## [5,]  140  155

The columns in the above matrix represents features and rows are observations. Lets see the distribution of these five observations by the features.

plot(A, pch=20, cex=2, xlab= "length", ylab="height")
mod <- lm(A[,2]~A[,1])
abline(mod, lty=2, col="red",lwd=3)

In the plot above we can see length and height are highly correlated and we can find a best line of fit (indicated by red dotted line) which captures most of the variance in this data. If we want to find a direction in this coordinate system, along which there is maximum variance that would be this regression line. Therefore, the line of fit gives the direction of the first principal component and a vector orthogonal to the first principal component is the second principal component along which occurs the remaining variance.

Steps:

1. Centering

The first step in PCA is to center the data to the origin. This is obtained by subtracting the column means from each column. In addition to centering of data, if there are variables with large difference in unit of measurement, like weight in grams and length in kilometers, scaling by standard deviation is done so that each variable contribute proper share to the principal components. In this example, we will perform only scaling of data. The figure below shows the centered plot using the sweep() function in R. Two blue arrows shows the direction of the principal components.

#Centering data
A_center <- sweep(A,2,colMeans(A))

plot(A_center, pch=20, cex=2, xlab= "length", ylab="height")
mod <- lm(A_center[,2]~A_center[,1])
abline(mod, lty=2, col="red",lwd=3)
arrows(x0=0,y0=0,x1=A_center[4,1],y1= predict(mod)[4],col="blue", lwd=2)
#neg of slope gives slope of orthogonal line
arrows(x0=0, y0=0, x1=-10,y1=-1*1/mod$coefficients[2]*-10,col="blue", lwd=2)
abline(h=0)
abline(v=0)

2. Eigen decomposition

The second step is to find the vectors giving the principal coordinates. This is obtained by eigen decomposition of covariance matrix of the centered matrix. In case of covariance matrix eigen vectors gives the direction of maximum variance of data in the order of eigen values. Eigen decompostion of a covariance matrix (\(p\times p\)) can be represented by the equation below.

\[A=Q\Lambda Q^T\]

The matrix \(Q\) also has (\(p \times p\)) dimensions and its columns are eigen vectors of the covariance matrix which are also the loadings for principal components. The matrix \(Q^T\) also has dimension of \(p \times p\) and is equal to the inverse of matrix \(Q\). Our original matrix when transformed by this inverse matrix gives the coordinates of our original matrix in eigen basis.

par(mfrow=c(1,2))
##now need the unit vectors in each direction given by eigen vector of covariance matrix
#covariance matrix can be calculated using matrix function 
#t(A_center)%*%A_center/(5-1) 
#or R function
rotation= eigen(cov(A_center))

plot(A_center, pch=20, cex=2, xlab= "length", ylab="height")
mod <- lm(A_center[,2]~A_center[,1])
abline(mod, lty=2, col="red",lwd=3)
arrows(x0=0,y0=0,x1=A_center[4,1],y1= predict(mod)[4],col="blue", lwd=2)
#neg of slope gives slope of orthogonal line
arrows(x0=0, y0=0, x1=-10,y1=-1*1/mod$coefficients[2]*-10,col="blue", lwd=2)
abline(h=0)
abline(v=0)
arrows(x0=0, y0=0, x1=rotation$vectors[1,1]*3,
       y1=rotation$vectors[2,1]*3,col="green", lwd=2, angle = 20,length=0.1 )
arrows(x0=0, y0=0, x1=rotation$vectors[1,2]*3,
       y1=rotation$vectors[2,2]*3,col="green", lwd=2, angle = 20,length=0.1 )


##Now rotate the data so that new coordinates are x and y axis
invRot <- t(rotation$vectors)
newCord <- invRot %*% t(A_center)
newCord <- t(newCord)

VarExplained <- rotation$values
plot(newCord, pch=20, cex=2,
     xlab= paste0("PC1 (", round(VarExplained[1]/sum(VarExplained)*100,2),"%)"),
     ylab=paste0("PC2 (", round(VarExplained[2]/sum(VarExplained)*100,2),"%)"),
     ylim=c(-30,30),
     xlim=c(-60,40))
arrows(x0=0,y0=0,x1=10,y1=0, col="green", lwd=2, length=0.1)
arrows(x0=0,y0=0,x1=0,y1=10, col="green", lwd=2, length=0.1)
abline(h=0)
abline(v=0)

3. Projection

Now that we have represented our data in new coordinate and data varies most in the direction of first component. 98.11% of the variance is explained by the first principal component and only 1.89 % variance is explained by the second component. We can represent this data using only the PC1 and we will be loosing only 1.89% of the variance. Figure below show the data projected onto PC1 (one dimension only).

##Now we can reduce dimension to one by taking PC1 only
plot(x=newCord[,1],y=rep(0,5), pch=20, cex=2,
     xlab= paste0("PC1 (", round(VarExplained[1]/sum(VarExplained)*100,2),"%)"),
     ylab="", yaxt="n")
arrows(x0=0,y0=0,x1=5,y1=0, col="green", lwd=2, length=0.1)
abline(h=0)
abline(v=0)

This concept from 2D can be easily scaled to any n-dimensional matrix where there are \(p\) features and \(n\) observations. Lets see how can we use PCA to reduce dimension and visualize on 2D coordinates the image data.

#Read image data
Image <- read.csv("DigitMatrix.csv")
#See the frequency of each digits 
label <- Image$y
table(label)

## label
##  0  5  7 
## 50 50 50

#Make image matrix
Image <- Image[,-1]
##define a function to view these images
show_digit = function(arr784, col = gray(12:1/12), ...) {
  image(matrix(as.matrix(arr784), nrow = 28)[, 28:1], col = col, ...)
}
#see image of a random observation
show_digit(Image[109,])

#View average of the three digit types 
par(mfrow=c(1,3))
show_digit(apply(Image[1:50,],2,mean))
show_digit(apply(Image[51:100,],2,mean))
show_digit(apply(Image[101:150,],2,mean))

Figures above shows a sample of image represented by \(28 \times 28\) pixels. We had three types of digits (0, 5, and 7) in this data set shown by the average of all 50 images in each category. Now lets apply PCA on this data by three methods which will give us same results.

prcomp() function in R
Eigen decomposition
SVD

`prcomp()`

The code below is the most commonly used method to perform PCA in R. For this function data matrix should be arranged so that the rows are the observation (n rows) and columns are the each features (p columns). This function provides options for centering and scaling so we don’t need to perform those operations before hand.

##PCA by R function 
#Remove 0 variance columns 
Image_filt <- Image[, apply(Image, 2, var)!=0]
PCA_prcomp <- prcomp(Image_filt, center=TRUE, scale=TRUE)

eigenValues <- PCA_prcomp$sdev^2
library(ggplot2, quietly = T)
ggplot(data.frame(PCA_prcomp$x), aes(x=PCA_prcomp$x[,1], y=PCA_prcomp$x[,2], color=as.factor(label)))+geom_point() + stat_ellipse()+
  labs(x=paste0("PC1 (",round(eigenValues[1]/sum(eigenValues)*100,2), "%)" ),
       y=paste0("PC2 (",round(eigenValues[2]/sum(eigenValues)*100,2) ,"%)"),
       title= "PCA by prcomp() function in R", color= "Digits")

Results

The function prcomp() in R gives five outputs:

sdev: These are square root of eigen values. When squared gives the value of diagonal of \(\Lambda\) matrix in eigne decomposition or the \(\Sigma\) matrix in svd. Barplot of the variance explained by each principal components is called screeplot.

par(mfrow=c(1,2))
screeplot(PCA_prcomp, main = "Screeplot by R function")
#or manually 
barplot((PCA_prcomp$sdev^2)[1:10], main= "Screeplot by barplot function")

rotation: This is a matrix where each column is an eigen vector of the input matrix. It has \(p \times p\) dimensions and values are the loadings for each feature in that particular column of eigen vector. This forms basis of biplot which in addition to the new coordinates of observations shows the contribution of each feature to the principal components. The default biplot function doesn’t work great when we have large number of features. Lets try to extract top 10 variables for PC1 and PC2 and manually plot those 20 variables using ggplot2.

library(gridExtra)
library(tidyverse)
p<-PCA_prcomp$rotation %>%
    as_tibble(rownames = "features")%>%
    select(1,2,3) %>%
    filter(order(abs(PC1),decreasing = T)%in%1:10|order(abs(PC2), decreasing = T)%in%1:10)

p2<-ggplot(data.frame(PCA_prcomp$x), aes(x=PCA_prcomp$x[,1], y=PCA_prcomp$x[,2], color=as.factor(label)))+
    geom_point() + stat_ellipse()+
    geom_segment(data=p,aes(x=0,y=0, xend=PC1*200, yend=PC2*200), arrow=arrow(length=unit(0.2,"cm")), color="cadetblue4")+
    geom_text(data=p,aes(x=PC1*200, y=PC2*200,label=features), inherit.aes=FALSE, size=3, vjust=1, hjust=1)+
    scale_y_continuous(name=paste0("PC2 (",round(eigenValues[2]/sum(eigenValues)*100,1),"%)"), sec.axis=sec_axis(~./200, name= "Variable loadings in PC2"))+
    scale_x_continuous(name=paste0("PC1 (",round(eigenValues[1]/sum(eigenValues)*100,1),"%)"), sec.axis=sec_axis(~./200,name="Variable loadings in PC1"))+
  labs(color="Digits")+
    theme(axis.title.y.right= element_text(color="cadetblue4"),
          axis.text.y.right=element_text(color="cadetblue4"),
          axis.text.x.top =element_text(color="cadetblue4"),
          axis.title.x.top= element_text(color="cadetblue4"))
   
p2

The plot above is a biplot with only top 10 contributors in PC1 and PC2 (20 in total). The axis label on right and top which are same colored as the arrows shows the loadings of each feature in the direction of PC1 and PC2. For example: Pixel X497 contributes most for PC1 and almost orthogonal to pixel X517 which contribute mostly to the PC2. This plot is also useful to identify the original features which might be of significant role distinguishing these three types of digits. If we see the position of these top 20 pixels in 28X28 pixel frame(figure below), we can see all of these pixels are on the bottom half of the image. Compare this to the average of each digit and we can see on average the upper half of these 3 digits looks very similar and the lower half has the difference which is beautifully captured by PCA.

top<- 1:784
top <- ifelse(top%in%as.numeric(substr(p$features,2,nchar(p$features))), top, "")
Vectors <- data.frame(x= rep(1:28, 28), y=rep(28:1,each=28), z= top)
p3 <- ggplot(Vectors, aes(x,y))+geom_text(aes(label=top), size=3)
p3

x: These are the new coordinates obtained by the rotation of the original matrix. Each observations is a row vector represented by \(p\) dimensions (\(p\) PCs). Since all PCs are orthogonal to each other, covariance matrix of x gives the diagonal matrix \(\Lambda\), where covaraince between the PCs is zero and covaraince with itself is the variance of each PC. Matrix below shows the sub-matrix with only 10 PCs.

round(cov(PCA_prcomp$x),3)[1:10,1:10] ##This gives diagonal matrix with eigen values in diagonal.

##         PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8    PC9   PC10
## PC1  64.347  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## PC2   0.000 39.789  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## PC3   0.000  0.000 29.925  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## PC4   0.000  0.000  0.000 23.753  0.000  0.000  0.000  0.000  0.000  0.000
## PC5   0.000  0.000  0.000  0.000 19.753  0.000  0.000  0.000  0.000  0.000
## PC6   0.000  0.000  0.000  0.000  0.000 19.101  0.000  0.000  0.000  0.000
## PC7   0.000  0.000  0.000  0.000  0.000  0.000 16.671  0.000  0.000  0.000
## PC8   0.000  0.000  0.000  0.000  0.000  0.000  0.000 14.001  0.000  0.000
## PC9   0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 12.752  0.000
## PC10  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 11.763

Eigen Decomposition:

PCA can also be calculated manually using eigen decomposition. First step is to center the data by mean of each feature and scaling by standard deviation. It is followed by calculation of covariance matrix for all the features (513X513). This covariance matrix qualifies for eigen decomposition and gives the variance of each new principal components as eigen values and inverse of eigen vector matrix can be used to rotate original data to the eigen basis.

ImageScaled <- scale(Image_filt, center = TRUE, scale=TRUE)
#get covariance matrix manually 
# covar <- matrix(nrow=513, ncol=513)
# for (i in 1:513) {
#     for (j in 1:513) {
#         covar[i,j] <- cov(ImageScaled[,i], ImageScaled[,j])
#     }
# }

#covar by R function
covar <- cov(ImageScaled)
##eigen decomposition of covariance matrix
rotate <- eigen(covar)

#invRot <- solve(rotate$vectors) #alternative method to get inverse of a matrix
invRot <- t(rotate$vectors)
newCord <- invRot %*% t(ImageScaled)
newCord <- t(newCord)

ggplot(data.frame(newCord), aes(x=newCord[,1], y=newCord[,2], color=as.factor(label)))+geom_point() + stat_ellipse()+
  labs(x=paste0("PC1 (",round(rotate$values[1]/sum(rotate$values)*100,2), "%)" ),
       y=paste0("PC2 (",round(rotate$values[2]/sum(rotate$values)*100,2) ,"%)"),
       title= "PCA by manual eigen decomposition", color= "Digits")

##Scree plot is barplot of eigne values
barplot(rotate$values[1:10], main = "Screeplot for manual eigen decomposition")

These plots above show that we get same result by both using prcomp() function or manually by eigen decomposition. The function prcomp() under the hood uses another method of matrix decomposition: Singular vector decomposition (SVD) which is computationally more efficient than eigen decomposition and has more wide application as it can be applied to non-square matrix as well. R has a function princomp() which does actual eigen decomposition but can be applied only when the number of observations are larger than the number of features.

Manual SVD:

SingularVal <- svd(ImageScaled)
##eigen values are given by  d: d^2/n-1
eigen_SVD = (SingularVal$d)^2/149

##The principle components are given by U*d
newCord2 = SingularVal$u %*% diag(SingularVal$d, nrow=150, ncol=150)

ggplot(data.frame(newCord2), aes(x=newCord2[,1], y=newCord2[,2], color=as.factor(label)))+geom_point() + stat_ellipse()

Principal coordinate analysis

It is essentially PCA but instead of calculating the direction where there is maximum variance in the features (pixels in this example), PCoA tries to find the direction of maximum distance between the samples. When the distance measured is euclidean, direction of maximum variance of features and maximum distance between samples is same.

#Distance matrix giving dist between samples
EucDist <-as.matrix(dist(ImageScaled, method="euclidean"))
PCoA<- cmdscale(EucDist, eig=TRUE)
ggplot(data.frame(PCoA$points), aes(x=PCoA$points[,1], y=PCoA$points[,2], color=as.factor(label)))+geom_point() + stat_ellipse()

##Manually perform PCoA by eigen decomposition 
x <- EucDist^2
#Double center this matrix 
R = x*0 + rowMeans(x)
C= t(x*0 + colMeans(x))
x_DC <- x-R-C+mean(x[])
e <- eigen(-x_DC/2, symmetric = T)
newCord3 <- e$vectors*rep(sqrt(e$values), each=150)

## Warning in sqrt(e$values): NaNs produced

ggplot(data.frame(newCord3), aes(x=newCord3[,1], y=-newCord3[,2], color=as.factor(label)))+geom_point() + stat_ellipse()

##For details see the reference 
#chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.cs.utah.edu/~jeffp/teaching/cs5140-S15/cs5140/L15-SVD.pdf
#https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional/132731#132731
#https://stackoverflow.com/questions/43639063/double-centering-in-r

Conclusion:

This shows that matrix decomposition is central to the dimension reduction and based on what matrix (covariance matrix or distance matrix) used some matrix calculation is necessary to find the PCs corresponding to the direction where data is spread most.

References:

Explanation of SVD by eigen decomposition https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
comparer PCA Vs PCoA https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c
PCoA https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional/132731#132731
PCA http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf