Clustering and PCA
Clustering Analysis
We can now perform hierarchical clustering to check replicability:
Let’s compute the pairwise correlation of the samples into a new
object, the matrix fpkms.cor
.
fpkms.cor <- cor(fpkms)
We can check this new matrix.
head(fpkms.cor)
Which are the dimensions of the correlation matrix? How is this size determined?
We now compute the euclidean (default) distance between the samples.
fpkms.dist <- dist(x = fpkms.cor)
We then perform the hierarchical clustering of the samples, and then plot the dendrogram.
fpkms.hclust <- hclust(d = fpkms.dist)
plot(fpkms.hclust)
How many clustering methods are available in the hclust
function? Which is the default method?
Another way to visualize the pairwise sample correlation and the output of
hierarchical clustering is to make a heatmap annotated with the clustering
dendrograms. The pheatmap
function can perform the clustering and generate
the heatmap that for us in a single step. We can also provide extra annotation
to visualize the treatment category of each sample.
forAnnotation <-
meta %>%
select(SampleID, Treatment_Duration) %>%
column_to_rownames("SampleID")
pheatmap(
mat = fpkms.cor,
annotation_col = forAnnotation,
annotation_row = forAnnotation
)
You can compare the tree with the one plotted in the previous step by looking at how the clustering groups the replicates, it is the same.
The default method for calculating the correlation in the cor
function is pearson correlation. How does the output change if we use spearman correlation instead?
Does the heatmap show what you would expect?
Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique that identifies and projects data onto orthogonal axes (principal components) to capture the most significant variance in the data while minimizing information loss.
PCA is an alternative approach that can provide us with a better insight into our data.
We can perfrom PCA by using the prcomp
function available from base
R
pca <-
fpkms %>%
t() %>%
prcomp(center = TRUE, scale. = TRUE)
What do the center
and scale
options do?
We can check a summary of the output.
summary(pca)
How many principal components explain 90% of the variation in our data?
Can you make a scatter plot with the values of the first two principal components?
Give it a shot, you can access the matrix with the values for all principal
components by typing pca$x
. Try also to annotate the color of the points
in the scatter plot according to the meta$Treatmen_Duration
.
Tip
In order to generate the scatter plot, you may use the plot
function from
base R
, or even the ggplot
and geom_point
functions from the
ggplot2
package.
Example solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Does the output of the PCA agree with that of the clustering analysis?