Gene annotation

Attention

Large genomic data should be generally stored in compressed files in order to save disk space.

Info

All standard linux tools like cat, more, grep, etc. have a z version (e.g. zcat) that can be used to read gzip compressed files.

Info

The less tool can read compressed files directly if the LESSOPEN environment variable is set correctly. It specifies an input formatter for less. The less package comes with a sample formatter in /bin/lesspipe; it decompresses gzipped files, shows content listings for many multi-file archive formats, and converts several formatted texts formats to plain text.

First we need to move to the data folder we just created:

cd rnaseq-course

Look at the gene annotation file:

zless refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz

Type Q to exit the view.

Store information about gene CSDE1 in a separate file:

zgrep CSDE1 refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz > CSDE1.gff

Count the number of transcripts of this gene:

awk '$3=="transcript"' CSDE1.gff | wc -l

Count the number of exons of this gene:

awk '$3=="exon"' CSDE1.gff | wc -l

What is the biotype of the gene?

What is the biotype of each transcript of this gene?

Advanced: Extract gene identifiers

Sometimes it is useful to get a list of gene identifiers from the annotation. For example, using the gene_type GTF tag:

zcat refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz | awk -F"\t" '$3=="gene" && $9~/gene_type "protein_coding"/' | grep -Eo 'ENS[A-Z]+[0-9]{11}[.]?[0-9]*' | sort -u > protein_coding_IDs.txt

Note that the above command represents one way of extracting this information. There are more efficient and portable ways of doing it and finding a better solution is left as an exercise to the student.