Data preparation

Data import

We will perform our primary analysis in R. To start you first need to load the data in the current environment.

raw_counts <- vroom("../quantification/raw_counts.tsv")

vroom is a function that reads the data and saves it in a tibble, which is a R object that contains a data-frame. You can check the head and the dimensions of the data by just calling the object you just created.

raw_counts

Tip

The head and dim functions can also be used to check the head and dimensions of an object.

You will also need the sample metadata, and info about the annotated genes.

meta <- vroom("../quantification/metadata.tsv")

gene_annotation <- vroom("../refs/gene_annotation.tsv",
  col_names = c(
    "chrom", "start", "end", "strand",
    "gene_id", "gene_type", "gene_name"
  )
)

What are the dimensions of the metadata and gene annotation tables?

Moreover, for the analysis is necessary that the columns of expression matrix and the rows of meta are in the same order.

raw_counts <-
  raw_counts %>%
  column_to_rownames("gene_id") %>%
  `[`(, meta$SampleID)

Info

The %>% operator is used to chain together operations similar to the use of the pipe | operator in bash.

If the output of the following command is TRUE, it means they are in the same order.

identical(colnames(raw_counts), meta$SampleID)

Data filtering

Genes with very low counts across all samples provide little information for differential expression. Hence, we usually filter these out. This filtering is usually done after normalizing the raw counts for the library size in each sample. This normalization gives us counts per million of reads (CPM) for every gene.

cpms <-
  raw_counts %>%
  as.matrix() %>%
  cpm()

Now for the filtering let’s keep a gene only if its value is greater than 1 cpm across all samples.

Can you filter the raw_counts table to keep the genes that have more than 1 cpm in all our samples?

Give it a shot, and save the filtered table in an object named raw_counts_filtered. If you find it challenging, there is a possible solution available.

Tip

After filtering 12260 genes should remain.

Example solution

cpm_filter <-
  cpms %>%
  # Test which values pass the filter of 1 cpm
  `>`(1) %>%
  # Sum up in how many samples each gene passes the filter
  rowSums() %>%
  # Test which genes pass the filter for all 12 samples
  `==`(12)

raw_counts_filtered <-
  raw_counts[cpm_filter, ]

Data Normalization

For calculating the normalization factors, we first need to put the counts into an edgeR object of class DGEList. We also put the variable of interest, i.e. the Treatment.

dge <-
  DGEList(counts = raw_counts_filtered,
          group = meta$Treatment)

Print the object dge now.

print(dge)

Inside dge, there are two objects:

dge$counts: the count values.
dge$samples: the description of the samples, with the variable of interest we gave (Treatment), the library size and the normalization factors.

The default value for the normalization factors is 1 for all samples. We can use the TMM (trimmed means of M-values) normalization method to correct for library size and composition.

dge <- calcNormFactors(object = dge, method = "TMM")

You can now print dge object and verify that norm.factors have changed.

FPKM matrix

For the next part of the analysis we will need a matrix with the FPKM (Fragments Per Kilobase of gene per Million of reads) values for all genes. To calculate these, we need to know the gene length for all the genes included in the dge object. We can calculate the length values from the information included in the gene_annotation table.

Can you use the information contained in the gene_annotation table to calculate the length of all the genes included in the dge object?

Give it a shot, and save gene length values in an object named gene_lengths. It should be a vector (an object that contains a sequence of different values in R). Also, the order of the lengths should correspond to the order of the genes in the dge$counts table. If you find it challenging, there is a possible solution available.

Tip

You can use the gene coordinates in the gene_annotation table to calculate gene lengths. The length of a gene is equal to end - start + 1.

Example solution

y_genes <-
  gene_annotation %>%
  filter(gene_id %in% rownames(dge$counts)) %>%
  group_by(gene_id) %>%
  summarise(n = n()) %>%
  filter(n == 2) %>%
  pull(gene_id)

# Get gene lengths
gene_lengths <-
  gene_annotation %>%
  # filter for genes in our dataset
  filter(gene_id %in% rownames(dge$counts) &
    # filter out genes in chrY,
    # whose chrX counterpart is in our dataset
    !(gene_id %in% y_genes & chrom == "chrY")) %>%
  # Order genes the same way they are ordered in our expression matrix
  mutate(order = match(gene_id, rownames(dge$counts))) %>%
  arrange(order) %>%
  # Calculate gene length
  mutate(length = end - start + 1) %>%
  pull(length)

Bonus questions

Check the example solution above. Then, try to answer the following questions:

Why do we have +1 in the length calculation formula?
The length we used for the normalization is based on the genomic coordinates of the genes. Is this the right length to use?
What does the y_genes object contain, and how do we use it for filtering? Tip: check the output of the following command: grep("_", rownames(raw_counts), value = TRUE)

Now we can get the fpkm matrix by using the rpkm function. The rpkm function uses the raw gene counts, the calculated normalization factors, and the gene lengths provided to calculate the fpkm values.

fpkms <- rpkm(dge,
  log = TRUE,
  gene.length = gene_lengths
)

Note

log-transformation is just one type of variance stabilization method.

What is the base of the logarithm used by the rpkm function?

Info

Before log-transforming the data we usually add a small value called “pseudocount” to all values of the expression table. This is done to avoid problems arising from applying the transformation to genes with 0 expression.

Does the rpkm function use a pseudocount? If it does, what is the value it assigns to it?