Clustering and PCA

Clustering Analysis

We can now perform hierarchical clustering to check replicability:

Let’s compute the pairwise correlation of the samples into a new object, the matrix fpkms.cor.

fpkms.cor <- cor(fpkms)

We can check this new matrix.

head(fpkms.cor)

Which are the dimensions of the correlation matrix? How is this size determined?

We now compute the euclidean (default) distance between the samples.

fpkms.dist <- dist(x = fpkms.cor)

We then perform the hierarchical clustering of the samples, and then plot the dendrogram.

fpkms.hclust <- hclust(d = fpkms.dist)
plot(fpkms.hclust)

How many clustering methods are available in the hclust function? Which is the default method?

Another way to visualize the pairwise sample correlation and the output of hierarchical clustering is to make a heatmap annotated with the clustering dendrograms. The pheatmap function can perform the clustering and generate the heatmap that for us in a single step. We can also provide extra annotation to visualize the treatment category of each sample.

forAnnotation <-
  meta %>%
  select(SampleID, Treatment_Duration) %>%
  column_to_rownames("SampleID")

pheatmap(
  mat = fpkms.cor,
  annotation_col = forAnnotation,
  annotation_row = forAnnotation
)

You can compare the tree with the one plotted in the previous step by looking at how the clustering groups the replicates, it is the same.

The default method for calculating the correlation in the cor function is pearson correlation. How does the output change if we use spearman correlation instead?

Does the heatmap show what you would expect?

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that identifies and projects data onto orthogonal axes (principal components) to capture the most significant variance in the data while minimizing information loss.

PCA is an alternative approach that can provide us with a better insight into our data.

We can perfrom PCA by using the prcomp function available from base R

pca <-
  fpkms %>%
  t() %>%
  prcomp(center = TRUE, scale. = TRUE)

What do the center and scale options do?

We can check a summary of the output.

summary(pca)

How many principal components explain 90% of the variation in our data?

Can you make a scatter plot with the values of the first two principal components?

Give it a shot, you can access the matrix with the values for all principal components by typing pca$x. Try also to annotate the color of the points in the scatter plot according to the meta$Treatmen_Duration.

Tip

In order to generate the scatter plot, you may use the plot function from base R, or even the ggplot and geom_point functions from the ggplot2 package.

Example solution

pca$x %>%
  # Change matrix to a data.frame
  as.data.frame() %>%
  # Add column with annotation about treatment duration
  cbind(Treatment_Duration = meta$Treatment_Duration) %>%
  # Make base ggplot
  ggplot(aes(PC1, PC2,
    col = Treatment_Duration,
    fill = Treatment_Duration
  )) +
  # Add ellipse layer
  stat_ellipse(
    geom = "polygon", level = 0.80,
    col = "black", alpha = 0.5
  ) +
  # Add scatter plot layer
  geom_point(shape = 21, col = "black")

Does the output of the PCA agree with that of the clustering analysis?