Gene annotation
Attention
Large genomic data should be generally stored in compressed files in order to save disk space.
Info
All standard linux tools like cat
, more
, grep
, etc. have a z
version (e.g. zcat
) that can be used to read gzip compressed files.
Info
The less
tool can read compressed files directly if the LESSOPEN
environment variable is set correctly. It specifies an input formatter
for less. The less package comes with a sample formatter in
/bin/lesspipe
; it decompresses gzipped files, shows content listings
for many multi-file archive formats, and converts several formatted
texts formats to plain text.
First we need to move to the data folder we just created:
cd rnaseq-course
Look at the gene annotation file:
zless refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz
Type Q to exit the view.
Store information about gene CSDE1 in a separate file:
zgrep CSDE1 refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz > CSDE1.gff
Count the number of transcripts of this gene:
awk '$3=="transcript"' CSDE1.gff | wc -l
Count the number of exons of this gene:
awk '$3=="exon"' CSDE1.gff | wc -l
What is the biotype of the gene?
What is the biotype of each transcript of this gene?
Advanced: Extract gene identifiers
Sometimes it is useful to get a list of gene identifiers from the
annotation. For example, using the gene_type
GTF tag:
zcat refs/gencode.v29.primary_assembly.annotation_UCSC_names.gtf.gz | awk -F"\t" '$3=="gene" && $9~/gene_type "protein_coding"/' | grep -Eo 'ENS[A-Z]+[0-9]{11}[.]?[0-9]*' | sort -u > protein_coding_IDs.txt
Note that the above command represents one way of extracting this information. There are more efficient and portable ways of doing it and finding a better solution is left as an exercise to the student.