Data preparation
Data import
We will perform our primary analysis in R
. To start you first need
to load the data in the current environment.
raw_counts <- vroom("../quantification/raw_counts.tsv")
vroom
is a function that reads the data and saves it in a tibble
,
which is a R
object that contains a data-frame. You can check the
head and the dimensions of the data by just calling the object you
just created.
raw_counts
Tip
The head
and dim
functions can also be used to check the head and
dimensions of an object.
You will also need the sample metadata, and info about the annotated genes.
meta <- vroom("../quantification/metadata.tsv")
gene_annotation <- vroom("../refs/gene_annotation.tsv",
col_names = c(
"chrom", "start", "end", "strand",
"gene_id", "gene_type", "gene_name"
)
)
What are the dimensions of the metadata and gene annotation tables?
Moreover, for the analysis is necessary that the columns of expression matrix and the rows of meta are in the same order.
raw_counts <-
raw_counts %>%
column_to_rownames("gene_id") %>%
`[`(, meta$SampleID)
Info
The %>%
operator is used to chain together operations similar
to the use of the pipe |
operator in bash
.
If the output of the following command is TRUE, it means they are in the same order.
identical(colnames(raw_counts), meta$SampleID)
Data filtering
Genes with very low counts across all samples provide little information for differential expression. Hence, we usually filter these out. This filtering is usually done after normalizing the raw counts for the library size in each sample. This normalization gives us counts per million of reads (CPM) for every gene.
cpms <-
raw_counts %>%
as.matrix() %>%
cpm()
Now for the filtering let’s keep a gene only if its value is greater than 1 cpm across all samples.
Can you filter the raw_counts
table to keep the genes that have more than 1 cpm in all our samples?
Give it a shot, and save the filtered table in an object named
raw_counts_filtered
. If you find it challenging, there is a possible
solution available.
Tip
After filtering 12260 genes should remain.
Example solution
1 2 3 4 5 6 7 8 9 10 11 |
|
Data Normalization
For calculating the normalization factors, we first need to put the
counts into an edgeR
object of class DGEList
. We also put the
variable of interest, i.e. the Treatment.
dge <-
DGEList(counts = raw_counts_filtered,
group = meta$Treatment)
Print the object dge now.
print(dge)
Inside dge, there are two objects:
dge$counts
: the count values.dge$samples
: the description of the samples, with the variable of interest we gave (Treatment), the library size and the normalization factors.
The default value for the normalization factors is 1
for all samples. We can
use the TMM
(trimmed means of M-values) normalization method to correct for
library size and composition.
dge <- calcNormFactors(object = dge, method = "TMM")
You can now print dge
object and verify that norm.factors
have
changed.
FPKM matrix
For the next part of the analysis we will need a matrix with the FPKM
(Fragments Per Kilobase of gene per Million of reads) values for all genes. To
calculate these, we need to know the gene length for all the genes included in
the dge
object. We can calculate the length values from the information
included in the gene_annotation
table.
Can you use the information contained in the gene_annotation
table to calculate the length of all the genes included in the dge
object?
Give it a shot, and save gene length values in an object named
gene_lengths
. It should be a vector
(an object that contains a sequence
of different values in R
). Also, the order of the lengths should
correspond to the order of the genes in the dge$counts
table. If you find
it challenging, there is a possible solution available.
Tip
You can use the gene coordinates in the gene_annotation
table to
calculate gene lengths. The length of a gene is equal to end - start + 1
.
Example solution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Bonus questions
Check the example solution above. Then, try to answer the following questions:
- Why do we have
+1
in the length calculation formula? - The length we used for the normalization is based on the genomic coordinates of the genes. Is this the right length to use?
- What does the
y_genes
object contain, and how do we use it for filtering? Tip: check the output of the following command:grep("_", rownames(raw_counts), value = TRUE)
Now we can get the fpkm matrix by using the rpkm
function. The
rpkm
function uses the raw gene counts, the calculated normalization
factors, and the gene lengths provided to calculate the fpkm values.
fpkms <- rpkm(dge,
log = TRUE,
gene.length = gene_lengths
)
Note
log-transformation is just one type of variance stabilization method.
What is the base of the logarithm used by the rpkm
function?
Info
Before log-transforming the data we usually add a small value called “pseudocount” to all values of the expression table. This is done to avoid problems arising from applying the transformation to genes with 0 expression.
Does the rpkm
function use a pseudocount? If it does, what is the value it assigns to it?