Why (a) PCA?
A principal component analysis is a way to reduce dimensionality of a data set consisting of numeric vectors to a lower dimensionality. Then it is possible to visualize the data set in three or less dimensions. Have a look at this use case. I’ll try to explain the motivation using a simple example.
Think of a very flat square (e.g. 1x1x.05) in a three dimensional space. What you see of this cuboid when looking at it from a specific angle while assuming perspectivic visual indicators away is basically a 2 dimensional projection of it. From one angle it looks like a rhombus, looking straight from the edge side all you see is a line-like shape. So obviously the representation offering the most information about this cuboid is the one you see when looking at it perpendicular to the main surface.
Then you see a square. Given that the square is very flat you have lost information (two separate points might lie exactly on top of each other now) – but as little as possible for a projection in that case. The same idea works analogously for higher dimensions. Just that then you wont’t be able to look at the real spacial arrangement and make a conscious decision about how to project it into a lower-dimensional space while keeping the relevant information. This is where PCA comes into play because as you will see this projection is often calculateable. I will exemplify this concept using a flat ellipsoid. The first part assumes basic knowledgle in linear alebra and statistics.
(The indefinite article in the sub title is supposed to express that there is not a single way to perform a principal component analysis but several variations.)
The algorithm
- The ellipsoid in our example is represented by 2000 points in a three dimensional space. They are put into a 2000×3-matrix P. One row per point and three columns for each point’s three coordinates x,y and z.
- From every matrix element of P we subtract the mean of every element located in the same column. This new matrix we name P’. Thas is called mean-correction.
- We calculate the covariance matrix C of P’.
- We calculate the eigenvectors and eigenvalues of C which we name v1, v2, v3 and e1, e2, e3. The eigenvalues and eigenvectors a numbered such that e1 ? e2 ? e3.
- You multiply every point with every eigenvector or – which is effectively the same -you multiply P’ with every eigen vector. This gives you the same point set, but represented using a new coordinate system.
So P’ x e1 is the new x-coordinate for every point, P’ x e2 the new y-coordinate and so on. In the illustration the eigenvectors are are normed to length 1 and shown as red, blue and green.
Those eigen vectors are in this context called “principal components”. e1 is the first principal component, e2 the second and e3 the third. The ordering by their associated eigenvalues is important because the larger the eigenvalue, the more important is the component. In this case it might seem odd to call all three components “principal” – but when we are using PCA for a case with originally 1000 dimensions involved and accordingly many components, then the first N components with comparatively large eigenvalues are of principal importance.
Please note how sensible the three components are “chosen”. The third and last component represents a vector along which the points show the least average distance and hence is also of least importance. Along the first component we observe on average the longest distance which is why this vector component gives the most information on the relative positions.
Reducing the dimensionality
Now that we have the new coordinate system formed by the three components we can display the point cloud using less dimensions and different combinations if we want. This is best illustrated with a plot:
And as promised …
… here comes the R code. But this script is more of academic interest as R already offers a function for performing a PCA – princomp.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
library(rgl) library(geozoo) # (1) constructing the elipsoidic point cloud p <- as.data.frame(ellipsoid(n=2000, a=1, b=0.4, c=0.1)[["points"]]) p$V3 <- p$V3 + p$V1 * .2 p$V3 <- p$V3 + p$V2 * .2 p$V2 <- p$V2 + p$V1 * .5 p$V1 <- p$V1 + 1 p$V2 <- p$V2 + 2 p$V1 <- p$V1 + runif(1000,-.1,.1) # (2) mean-correcting the point matrix p0 <- p p0[] <- lapply(p, function(x) x - mean(x)) m0 <- as.matrix(p0) # (3) the covariance matrix covM0 <- cov(m0) # (4) the eigen vectors / principal components e <- eigen(covM0) # (5) transforming the points pca <- data.frame( x = (m0 %*% e$vectors[,1]), y = (m0 %*% e$vectors[,2]), z = (m0 %*% e$vectors[,3]) ) plot3d(x=p0$V1, y=p0$V2, z=p0$V3, aspect="iso", type="s", size=.3, alpha=.5, xlab="X", ylab="Y", zlab="Z") lines3d(x=c(0,e$vectors[1,1]), y=c(0,e$vectors[2,1]), z=c(0,e$vectors[3,1]), col="red", lwd=2) lines3d(x=c(0,e$vectors[1,2]), y=c(0,e$vectors[2,2]), z=c(0,e$vectors[3,2]), col="blue", lwd=2) lines3d(x=c(0,e$vectors[1,3]), y=c(0,e$vectors[2,3]), z=c(0,e$vectors[3,3]), col="darkgreen", lwd=2) text3d(x=e$vectors[1,1], y=e$vectors[2,1], z=e$vectors[3,1], text="1st", font=2, adj=-1, col="red") text3d(x=e$vectors[1,2], y=e$vectors[2,2], z=e$vectors[3,2], text="2nd", font=2, adj=-1, col="blue") text3d(x=e$vectors[1,3], y=e$vectors[2,3], z=e$vectors[3,3], text="3rd", font=2, adj=-1, col="darkgreen") # plotting the columns of the covariance matrix dfCovM0 <- as.data.frame(covM0) dfCovM0 <- as.data.frame(lapply(dfCovM0, function(x) x/sqrt(sum(x^2)))) lines3d(x=c(0,dfCovM0[1,1]), y=c(0,dfCovM0[2,1]), z=c(0,dfCovM0[3,1]), col="red", alpha=.5) lines3d(x=c(0,dfCovM0[1,2]), y=c(0,dfCovM0[2,2]), z=c(0,dfCovM0[3,2]), col="blue", alpha=.5) lines3d(x=c(0,dfCovM0[1,3]), y=c(0,dfCovM0[2,3]), z=c(0,dfCovM0[3,3]), col="darkgreen", alpha=.5) par(mfrow = c(2,3)) plot(pca[,c("x","y")], col=rgb(0,0,0,.3), pch=16, main="Projection against 1st and 2nd eigen vector") plot(pca[,c("y","z")], col=rgb(0,0,0,.3), pch=16, main="Projection against 2nd and 3rd eigen vector") plot(pca[,c("z","x")], col=rgb(0,0,0,.3), pch=16, main="Projection against 3rd and 1st eigen vector") hist(pca[,c("x")],breaks=50, xlab="", main="Projection against 1st eigen vector") hist(pca[,c("y")],breaks=50, xlab="", main="Projection against 2nd eigen vector") hist(pca[,c("z")],breaks=50, xlab="", main="Projection against 3rd eigen vector") browseURL(paste("file://", writeWebGL(dir = file.path(tempdir(), "webGL"),width=1000,font="Arial"), sep="")) |
Final word
If my explanations didn’t work for you well or if you just would like to get a different take on this subject then I recommend this text by Linday I. Smith. It starts with an introduction to the mathematical tools used – like covariance matrix and eigenvalues.