1. Introduction
The RNA-seq, ChIP-seq and ATAC-seq data used for the analysis comes from forebrain, heart and liver of mouse embryo samples at 12.5 days of development, done in 2 bioreplicates (1% of the reads are randomly sampled here). This data is provided by ENCODE: Encyclopedia of DNA Elements (www.encodeproject.org/).
For time and memory restrictions, only one sample is used for the data processing. This sample has been built with a hundred thousand alignment-based pre-filtered reads. A small sample of the genome and annotation (21 chromosomes, 1Mb long) is used as the reference.
Browse the content using the left panel. |
A copy of the slides of the course can be found here: Introduction; Basic concepts; RNA-seq processing; Gene level RNA-seq analysis; Isoform level RNA-seq analysis; Multiomics. |
2. Basic concepts and setup
2.1. Hands-on setup (done once)
We are going to run all the commands of the hands-on within a Docker container using basic Linux commands and scripts from Git.
2.1.1. Bash shell
Linux and Mac : The Bash shell is available on Linux and Mac OS.
|
Windows : Use VirtualBox or VMWare player to import this virtual machine with Ubuntu 18.04 and Docker pre-installed. Follow the instructions provided by Diego Garrido here.
|
Access your Bash shell console and test the commands below.
2.1.2. Basic Linux commands
Browse the directory structure
|
tells you where you are |
|
list the content of the current directory |
|
list the content of a directory |
|
go to the specified directory |
|
go to your home directory |
|
go to the parent directory |
|
list the content of a directory in a tree-like format |
|
creates specified directory |
View the content of a file
|
view text with paging |
|
prints first lines of a file |
|
prints last lines of a file |
|
print content of a file into the screen |
|
print content of a |
File manipulations
|
remove file |
|
copy file1 into file2 |
|
rename file1 to file2 |
Some other useful commands
|
show lines of text containing a given pattern |
|
show lines of text not containing a given pattern |
|
sort linesof text files |
|
counting words, lines and characters |
|
allows to redirect the output to a file |
|
allows to send output from one program to another |
|
to extract portion of a file by selecting columns |
|
input a line of text and display it on standard output |
AWK programming
AWK - UNIX shell programming language. A fast and stable tool for processing text files.
|
search for the pattern 'www' in the each line of the file |
|
search for pattern 'www' in the third column of the file |
|
print every line in the file that is longer than 80 characters |
|
print even-numbered lines in the file |
Some built-in variables
|
Number of records |
|
Number of fields |
|
Field separator character |
|
Output field separator character |
See www.grymoire.com/Unix/Awk.html and www.tutorialspoint.com/awk/awk_basic_examples.htm for more information |
Writing and editing files
There are several ways to write and edit files.
The most commonly used text editors are vi
, emacs
and nano
.
Vi
Vi is a text editor with a command mode interface. See this article for a beginner’s guide.
To create a new file in the current folder just call vi
with the name of the file:
vi file.txt
Vi has several modes which are specified in the bottom part of screen.
When you just open the editor you cannot write on the file, because it is the command mode.
To start writing, press i, and you will notice that the status in the bottom of the screen changed to INSERT
.
Now you can paste the text by right-click with the mouse and paste, or by pressing SHIFT+Insert.
To go back to command mode press ESC.
To save, make sure you are in command mode and type:
:wq
To undo, press u when you are in command mode.
Emacs
Emacs is a text editor with a command mode interface. See this article for a beginner’s guide.
emacs file.txt
When you are done, save the file (by pressing Ctrl+x Ctrl+s) and close Emacs (by pressing Ctrl+x Ctrl+c).
Nano
Nano is a text editor with a command mode interface. See this article for a beginner’s guide.
To create a new file in the current folder just call nano
with the name of the file:
nano file.txt
When you are done, close Nano (by pressing Ctrl+Shift+X) and press Y for saving the changes.
2.1.3. Docker
Access your Bash shell console and test the commands in the cheatsheet below.
General commands
|
start the docker daemon |
|
get help with Docker. Can also use –help on all subcommands |
|
display system-wide information |
Images
|
list local images |
|
delete an Image |
|
remove all unused images |
Containers
|
create and run a container from an image, with a custom name |
|
to list currently running containers |
|
list all docker containers (running and stopped) |
|
view resource usage stats |
See docs.docker.com/get-started/docker_cheatsheet.pdf for more information |
Pull the image from DockerHub:
docker pull ceciliaklein/teaching:uvic
Run useful commands to check the images available.
2.1.4. Data
All the data needed for the course is stored in the following directory:
https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/data/tutorial.tar.gz
Download and uncompress the data:
wget https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/data/tutorial.tar.gz
tar -zxvf tutorial.tar.gz
We are going to use the folder tutorial
as our working directory. Move to it:
cd tutorial
2.1.5. Scripts from GitHub
The scripts used in this course are available in GitHub:
git clone https://github.com/CeciliaKlein/teaching-utils.git
Review Git hands-on by Diego Garrido here. |
Check which scripts and R wrappers are available and the extension of each file.
ls teaching-utils
Which are extensions are available? Check the overall content of some file types by using the basic Linux commands.
2.2. Run container (repeat every restart of hands-on)
We are going to run all the commands of the hands-on within a Docker container. Check the path to the folder where you saved the data for the course.
In order to run the container shell you need to run the following command in your terminal:
Replace /path/to/tutorial to the path where the tutorial data is stored in your computer. Use pwd to check your current location.
|
docker run -i -t -v /path/to/tutorial:/tutorial -w /tutorial ceciliaklein/teaching:uvic
You should get a similar output:
(main) root@516a2307ed2b:/tutorial#
To access scripts without indication the full path, add /teaching-utils folder to the PATH environment variable:
export PATH=$PATH:/tutorial/teaching-utils/;
Confirm the previous commands worked by looking at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. edgeR.analysis.R --help).
We need to run the commands in this section Run container (repeat every restart of hands-on) (i.e. docker run and export ) every time we restart the hands-on.
|
2.3. Reference gene annotation
Look at the mouse gene annotation (ensembl v65, long genes):
more /tutorial/refs/gencode.vM4.gtf
Type q to exit the view.
Store information about gene Sec23a in a separate file:
grep Sec23a /tutorial/refs/gencode.vM4.gtf > Sec23a.gtf
Count the number of transcripts of this gene:
awk '$3=="transcript"' Sec23a.gtf | wc -l
Count the number of exons of this gene:
awk '$3=="exon"' Sec23a.gtf | wc -l
What is the biotype of the gene?
What is the biotype of each transcript of this gene?
2.3.1. Extract gene identifiers
Sometimes it is useful to get a list of gene identifiers from the annotation. For example, using the gene_type
GTF tag:
# select protein-coding gene ids
awk '$3=="gene" && $0~/gene_type "protein_coding"/{ match($0, /gene_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf | sort > protein_coding_IDs.txt
# select long non-coding gene ids
awk '$3=="gene" && $0~/gene_type "lincRNA"/{ match($0, /gene_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf | sort > lincRNA_IDs.txt
2.4. Common file formats
2.4.1. FASTA
FASTA is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.
Let’s have a look at the genome FASTA file we user in the workshop:
more /tutorial/refs/genome.fa
>chr1 (1) NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (2) NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
1 | sequence description, can be a long description (see below) |
2 | raw sequence |
Type q to exit the view.
The word following the >
symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the >
and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a >
appears; this indicates the start of another sequence.
2.4.2. FASTQ
FASTQ is a text-based format for storing biological sequences and their corresponding quality scores.
Have a look at one of the FASTQ files for the workshop:
zcat /tutorial/data/mouse_brain_rep1_1.fastq.gz | head -4
@HWI-ST985:73:C08BWACXX:8:1101:1920:2006 1:N:0: (1) NTGCTCGGCCTCTTTCAGCTGTTTCTGCAGCTGCTGAATATCACTGTCTCTCTTCTCTACTTCTTTCTCTAAAGCCTGCATTTCGTGGTGAACTTTTCCCT (2) + (3) #1=DDFFFHHHHHJJHIIJJJJJJJJJJJIIJGIJJJGIFIGEIGIGHIIIJJIJJIJIJIJJJJIJHJJHHIIIHHHGHHDFBCDCDBCCDDDDDDDDDD (4)
1 | sequence id - begins with the @ character and is followed by a sequence identifier and an optional description |
2 | raw sequence |
3 | begins with the + character and is optionally followed by the same sequence identifier (and any description) again |
4 | quality - encodes the quality values for the sequence and must contain the same number of symbols |
The fourth line can also begin with @ depending on the quality encoding (see below)
|
Quality encoding
A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). The most used formula is the Phred quality score:
offset | max Phred score range | max ASCII range | real-world Phred score range | real-world ASCII range |
---|---|---|---|---|
33 |
0 - 93 |
33 - 126 |
0 - 40 |
33 - 73 |
64 |
0 - 62 |
64 - 126 |
0 - 40 |
64 - 104 |
2.4.3. SAM
Sequence Alignment/Map format is a TAB-delimited text format for storing sequence alignment information. See here for the full format specification.
Header
SAM can have an optional header section. If present it should prior to the alignments.
Header lines start with @
, while alignment lines do not.
@HD
The HD
header line describes the header with format version (VN
) and sorting order of alignments (SO
).
@HD VN:1.4 SO:coordinate
@SQ
The @SQ
header line describe the reference sequence dictionary. The order of @SQ
lines defines the alignment sorting order.
@SQ SN:chr1 LN:197195432 @SQ SN:chr10 LN:129993255 ...
@PG
The @PG
line describes the program used to generate the file. It contains the program id (ID
), name (PN
), version (VN
) and command line (CL
).
@PG ID:STAR PN:STAR VN:STAR_2.4.2a CL:STAR --runThreadN 2 --genomeDir refs/mouse_genome_mm9_STAR_index --readFilesIn data/mouse_cns_E14_rep1_1.fastq.gz data/mouse_cns_E14_rep1_2.fastq.gz --readFilesCommand pigz -p2 -dc --outFileNamePrefix mouse_cns_E14_rep1 --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outFilterType BySJout --quantMode TranscriptomeSAM
Alignments
Each alignment line represents the linear alignment of a sequence. Each line
has 11 mandatory fields. These fields always appear in the same order and must be present, but their values
can be 0
or *
(depending on the field) if the corresponding information is unavailable. Each line can also have optional extra fields.
HWI-ST985:73:C08BWACXX:6:1103:3691:25179 (1) 99 (2) chr1 (3) 3157636 (4) 255 (5) 101M (6) = (7) 3157698 (8) 163 (9) GTTTGGTAAGAAAAACAACTTAATTACTACTTCAGAATTTATTTCATTTTTTTTCTAAAGAGTAAGAAGGAAGATTCTATGTTCACATATTTTGGTGCTCA (10) BB@FFFDFFHHHGJJIIJJJJIHIIIJJIIJJJCHFGHHJDHIJIJIJIJJJJIIGGHGHAE>EEHGHFFEFFEECCCDDEDEEEDDDDEEEEDBBDCCCC (11) NH:i:1 HI:i:1 AS:i:200 NM:i:0 MD:Z:101 (12)
1 | sequence id |
2 | flag |
3 | reference sequence name |
4 | alignment position (1-based ) |
5 | mapping quality |
6 | CIGAR string |
7 | reference sequence name of the primary alignment of the mate (= if it is the same as the current alignment) |
8 | position of the primary alignment of the mate |
9 | observed fragment length |
10 | sequence |
11 | quality |
12 | optional fields |
CIGAR string
most common CIGAR abbreviations:
1 |
M |
alignment match |
2 |
I |
Insertion to the reference |
3 |
D |
Deletion from the reference |
4 |
N |
Skipped region |
5 |
S |
Soft clipping, clipped sequences present in SEQ |
6 |
H |
Hard clipping, clipped sequences present in SEQ |
101M == entire read was mapped without gaps, continuous mapping 51M1217N50M == read was mapped with 217nt gap, split mapping
BAM
BAM is a BGZF compressed version of the SAM format. BGZF is block compression implemented on top of the standard gzip file format. The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is ‘gunzip compatible’, in the sense that a compliant gunzip utility can decompress a BGZF compressed file.
Random access
BGZF files support random access through the BAM file index. Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments. BAM must be sorted by the reference ID and then the leftmost coordinate before. The standard format for indexing BAM files is the BAI format.
The BAI format is based on an underlying binning scheme. Each bin represents a contiguous genomic region which is either fully contained in or non-overlapping with another bin; each alignment is associated with a bin which represents the smallest region containing the entire alignment. To find the alignments that overlap a specified region, we need to get the bins that overlap the region, and then test each alignment in the bins to check overlap.
Many programs need an accompanying BAI index for every input BAM. They usually assume the index is located next to the input file in the filesystem and has the same name as the input file, with the .bai
extension appended. E.g.:
mouse_brain_rep1_m4_n10_toGenome.bam mouse_brain_rep1_m4_n10_toGenome.bam.bai
We will have a look at the contents of a BAM file in the [GRAPE] section.
CRAM
CRAM is a framework technology with highly efficient and tunable reference-based compression of sequence data and a data format that is directly available for computational use. It supports different compression formats (gzip, bzip2, CRAM records). CRAM records use a wide range of lossless and lossy encoding strategies. See here for the full format specification.
2.4.4. GTF
GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type
/value
pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.
The attribute list must begin with the two mandatory attributes:
gene_id |
A globally unique identifier for the genomic source of the sequence. |
transcript_id |
A globally unique identifier for the predicted transcript. |
Check the nineth GTF field of the sample annotation of the workshop:
cut -f9 refs/gencode.vM4.gtf | head -1
gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
Look at the first line:
head -1 refs/gencode.vM4.gtf | tr '\t' '\n'
chr1 (1) HAVANA (2) gene (3) 3073253 (4) 3074322 (5) . (6) + (7) . (8) gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1"; (9)
1 | seqname - The name of the sequence. Must be a chromosome or scaffold. |
2 | source - The program that generated this feature. |
3 | feature - The name of this type of feature. Some examples of standard feature types are CDS , start_codon , stop_codon , and exon . |
4 | start - The starting position of the feature in the sequence (1-based ). |
5 | end - The ending position of the feature (inclusive). |
6 | score - A score between 0 and 1000. |
7 | strand - Valid entries include + , - , or . (for don’t know/don’t care). |
8 | frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be . . |
9 | group - All lines with the same group are linked together into a single item. Note the gene_id and transcript_id mandatory attributes. |
Feature types
The third field in the GFF specification represents feature type for the line. In addition to the standard features mentioned above there can be exon
, gene
and transcript
.
Look for the first of each feature in the sample annotation of the workshop:
awk '$3=="exon"' refs/gencode.vM4.gtf | head -1
chr1 HAVANA exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
awk '$3=="gene"' refs/gencode.vM4.gtf | head -1
chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
awk '$3=="transcript"' refs/gencode.vM4.gtf | head -1
chr1 HAVANA transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
2.4.5. BED
BED (Browser Extensible Data) format provides a flexible way to represent genomic regions. It is also widely used to define the data lines that are displayed in an annotation track of the UCSC genome browser. BED lines have three required fields and nine additional optional fields. For this reason there are several BEDn format (e.g. BED9, BED12) depending on the number of optional fields. The number of fields per line must be consistent throughout any single file.
The first three required BED fields are:
7 |
chrom |
The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). |
8 |
chromStart |
The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. |
9 |
chromEnd |
The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. |
The 9 additional optional BED fields are:
10 |
name |
Defines the name of the BED line. |
11 |
score |
A score between 0 and 1000. |
12 |
strand |
Defines the strand - either |
13 |
thickStart |
The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position. |
14 |
thickEnd |
The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays). |
15 |
itemRgb |
An RGB value of the form R,G,B (e.g. 255,0,0). |
16 |
blockCount |
The number of blocks (exons) in the BED line. |
17 |
blockSizes |
A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount. |
18 |
blockStarts |
A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount. |
BedGraph
The bedGraph
format allows display of continuous-valued data. This display type is useful for values like probability scores and transcriptome data. It is also widely used to define the data lines that are displayed in a track of the UCSC genome browser. The bedGraph
format is also an older format used to display sparse data or data that contains elements of varying size.
The bedGraph
format is line-oriented. BedGraph
data could be preceeded by a track definition line, which adds a number of options for controlling the default display of this track.
Following the track definition line are the track data in four column BED format.
chromA chromStartA chromEndA dataValueA chromB chromStartB chromEndB dataValueB
BedGraph
data values can be integer or real, positive or negative values. Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data. All positions specified in the input data must be in numerical order.
2.4.6. BigWig
The bigWig
format is for display of dense, continuous data. BigWig
data can be displayed in the UCSC Genome Browser as a graph. BigWig
files are created initially from bedGraph
or wiggle
type files. The resulting bigWig
files are in an indexed binary format. The main advantage of the bigWig
files is that only the portions of the files needed to display a particular region are transferred to UCSC, so for large data sets bigWig
is considerably faster than regular files. The bigWig
file remains on your web accessible server (http, https, or ftp), not on the UCSC server. Only the portion that is needed for the chromosomal position you are currently viewing is locally cached as a "sparse file".
We will see an example of BigWig usage in the section Visualize RNA-seq signal in the UCSC genome browser.
Wiggle
The wiggle
(WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle
data elements must be equally sized.
For speed and efficiency, wiggle
data is compressed and stored internally in 128 unique bins. This compression means that there is a minor loss of precision when data is exported from a wiggle
track in the UCSC table browser. The bedGraph
format should be used if it is important to retain exact data when exporting.
Wiggle
format is line-oriented. For wiggle
custom tracks in the UCSC browser, the first line must be a track definition line, which designates the track as a wiggle
track and adds a number of options for controlling the default display.
Wiggle
format is composed of declaration lines and data lines, and require a separate wiggle
track definition line. There are two options for formatting wiggle
data: variableStep
and fixedStep
. These formats were developed to allow the file to be written as compactly as possible.
-
variableStep
is for data with irregular intervals between new data points and is the more commonly usedwiggle
format. ThevariableStep
section begins with a declaration line and is followed by two columns containing chromosome positions and data values:
variableStep chrom=chrN [span=windowSize] chromStartA dataValueA chromStartB dataValueB
-
fixedStep
is for data with regular intervals between new data values and is the more compactwiggle
format. ThefixedStep
section begins with a declaration line and is followed by a single column of data values:
fixedStep chrom=chrN start=position step=stepInterval [span=windowSize] dataValue1 dataValue2
Wiggle
track data values can be integer or real, positive or negative values. Chromosome positions are specified as 1-relative. For a chromosome of length N, the first position is 1 and the last position is N. Only positions specified have data. All positions specified in the input data must be in numerical order.
Extracting Data from the bigWig Format
Because the bigWig files are indexed binary files, they can be difficult to extract data from. Consequently, the following programs developed at the UCSC can be used to extract data from bigWig files:
bigWigToBedGraph |
this program converts a bigWig file to ASCII bedGraph format. |
bigWigToWig |
this program converts a bigWig file to wig format. |
bigWigSummary |
this program extracts summary information from a bigWig file. |
bigWigAverageOverBed |
this program computes the average score of a bigWig over each bed, which may have introns. |
bigWigInfo |
this program prints out information about a bigWig file. |
3. RNA-seq data processing
3.1. Fastq files and read QC
Create a folder for the fastqc output:
mkdir -p fastqc
Run fastqc:
fastqc -o fastqc -f fastq /tutorial/data/mouse_brain_rep1_1.fastq.gz
You can view fastqc results on the browser outside the container:
fastqc/mouse_brain_rep1_1_fastqc.html
3.2. Mapping
Create a folder for the alignment steps:
mkdir alignments
Create the genome index for STAR:
mkdir genomeIndex
STAR --runThreadN 2 \
--runMode genomeGenerate \
--genomeDir genomeIndex \
--genomeFastaFiles /tutorial/refs/genome.fa \
--sjdbGTFfile /tutorial/refs/anno.gtf \
--genomeSAindexNbases 11
If you are unable to run the index command, please download the index wget public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/extra_data/genomeIndex.tar.gz , followed by tar -zxvf genomeIndex.tar.gz .
|
Run the following command to perform the mapping:
STAR --runThreadN 2 \
--genomeDir genomeIndex \
--readFilesIn /tutorial/data/mouse_brain_rep1_1.fastq.gz \
/tutorial/data/mouse_brain_rep1_2.fastq.gz \
--outSAMunmapped Within \
--outFilterType BySJout \
--outSAMattributes NH HI AS NM MD \
--readFilesCommand pigz -p2 -dc \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM \
--outFileNamePrefix alignments/mouse_brain_rep1_
Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help ).
|
When finished we can look at the bam file:
samtools view -h alignments/mouse_brain_rep1_Aligned.sortedByCoord.out.bam | more
Type q to exit the view.
or at the mapping statistics that come with STAR
:
cat alignments/mouse_brain_rep1_Log.final.out
Started job on | Dec 27 16:01:03 Started mapping on | Dec 27 16:01:04 Finished on | Dec 27 16:02:16 Mapping speed, Million of reads per hour | 5.00 Number of input reads | 100000 Average input read length | 202 UNIQUE READS: Uniquely mapped reads number | 90162 Uniquely mapped reads % | 90.16% Average mapped length | 200.46 Number of splices: Total | 37153 Number of splices: Annotated (sjdb) | 1851 Number of splices: GT/AG | 36794 Number of splices: GC/AG | 190 Number of splices: AT/AC | 2 Number of splices: Non-canonical | 167 Mismatch rate per base, % | 0.23% Deletion rate per base | 0.01% Deletion average length | 1.59 Insertion rate per base | 0.00% Insertion average length | 1.39 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 1713 % of reads mapped to multiple loci | 1.71% Number of reads mapped to too many loci | 100 % of reads mapped to too many loci | 0.10% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 7.96% % of reads unmapped: other | 0.06% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
3.3. RNA-seq signal files
We user STAR to generate BedGraph files from the alignments. Launch the following command to run it:
STAR --runThreadN 2 \
--runMode inputAlignmentsFromBAM \
--inputBAMfile alignments/mouse_brain_rep1_Aligned.sortedByCoord.out.bam \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outFileNamePrefix alignments/mouse_brain_rep1_ \
--outWigReferencesPrefix chr
In order to simplify this part of the hands-on we are processing the data as unstranded even if it is stranded. |
STAR produces two files: one using uniquely mapped reads, the other using multimapped reads. We convert one of the BedGraph files to BigWig:
bedGraphToBigWig alignments/mouse_brain_rep1_Signal.Unique.str1.out.bg \
/tutorial/refs/genome.fa.fai \
alignments/mouse_brain_rep1_uniq.bw
Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help ).
|
3.4. Visualize RNA-seq signal in the UCSC genome browser
Add the gene expression track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)
Go to the UCSC genome browser web page:
On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:
genome: Mouse assembly: Dec 2011 (GRCm38/mm10) [mm10]
Paste the track specifications in the box Paste URLs or data
:
track autoScale=on name=RNAseq,brain,PRNAembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain1.bw track autoScale=on name=RNAseq,brain,PRNAembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain2.bw track autoScale=on name=RNAseq,heart,PRNAembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoHeart1.bw track autoScale=on name=RNAseq,heart,PRNAembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoHeart2.bw track autoScale=on name=RNAseq,liver,PRNAembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver1.bw track autoScale=on name=RNAseq,liver,PRNAembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver2.bw
Click Submit
Go to the genome browser to look at some genes and their RNA-seq signal
3.5. Transcript and gene expression quantification
Create a folder for the quantifications:
mkdir quantifications
Create the transcriptome index for RSEM:
mkdir txIndex
rsem-prepare-reference --gtf /tutorial/refs/anno.gtf \
/tutorial/refs/genome.fa \
txIndex/RSEMref
Run the following command to estimete the expression:
rsem-calculate-expression --num-threads 2 \
--bam \
--paired-end \
--estimate-rspd \
--forward-prob 0 \
--no-bam-output \
--seed 12345 \
alignments/mouse_brain_rep1_Aligned.toTranscriptome.out.bam \
txIndex/RSEMref \
quantifications/mouse_brain_rep1
Here we are correctly treating the data as stranded. |
Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help ).
|
RSEM generates expression quantification at transcript and gene level. For the data analysis section we are going to use only gene expression data.
To obtain a matrix of gene TPM values:
cat /tutorial/data/gene.quantifications.index.txt \
| retrieve_element_rpkms.py --output encode \
--organism mouse \
--element gene \
--value TPM \
--output_dir quantifications
To obtain a matrix of gene read counts:
cat /tutorial/data/gene.quantifications.index.txt \
| retrieve_element_rpkms.py --output encode \
--organism mouse \
--element gene \
--value expected_count \
--output_dir quantifications
Generate de same two matrices for transcripts and check output files (e.g. by typing head or more).
Compare number of quantified genes and transcripts (e.g. by typing wc -l).
How many genes were quantified?
How many transcripts were quantified?
Is it the same number as in the annotation file /tutorial/refs/gencode.vM4.gtf
?
4. Gene level RNA-seq data analysis
Create a directory dedicated to the analyses:
mkdir analysis
And move into it:
cd analysis
Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. edgeR.analysis.R --help ).
|
4.1. Protein coding vs long-non coding
For the first part of the analysis section we are going to produce separate plots for protein coding and long non-coding genes. We will split the quantification matrix, using gene identifiers selected in 4.1:
# protein-coding
(head -1 ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv; grep -Ff ../protein_coding_IDs.txt ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv) > ../quantifications/encode.mouse.pc.gene.TPM.idr_NA.tsv
# long non-coding
(head -1 ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv; grep -Ff ../lincRNA_IDs.txt ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv) > ../quantifications/encode.mouse.linc.gene.TPM.idr_NA.tsv
4.2. TPM distribution
Have a look at the distribution of RPKM values:
rpkm_distribution.R --input_matrix ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
--log \
--pseudocount 0 \
--metadata ../data/gene.quantifications.index.tsv \
--fill_by tissue \
--palette /tutorial/palettes/palTissue.txt \
--output all_genes
To look at the plot outside the container:
boxplot.log_T.psd_0.all_genes.pdf
Look at the distributions of expression values for protein-coding and lincRNAs, using this command to subset the matrix:
selectMatrixRows.sh /tutorial/protein_coding_IDs.txt /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA.tsv > protein-coding.TPM.tsv
Repeat the generation of the boxplot with the new matrix (change the output label).
4.3. Clustering analysis
Perform hierarchical clustering to check replicability:
matrix_to_dist.R --input_matrix ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
--log10 \
--cor pearson \
--output stdout \
| ggheatmap.R --input_matrix stdin \
--row_metadata /tutorial/data/gene.quantifications.index.tsv \
--col_dendro \
--row_dendro \
--base_size 10 \
--matrix_palette /tutorial/palettes/palSequential.txt \
--rowSide_palette /tutorial/palettes/palTissue.txt \
--rowSide_by tissue \
--matrix_fill_limits 0.85,1 \
--output samples_clustering.pdf
Look at the clustering.
4.4. Differential gene expression
Run the DE with the edgeR package (be careful takes read counts and not TPM values as input):
edgeR.analysis.R --input_matrix ../quantifications/encode.mouse.gene.expected_count.idr_NA.tsv \
--metadata /tutorial/data/gene.quantifications.index.tsv \
--fields tissue \
--coefficient 2 \
--output brain_X_heart
Write a list of the genes overexpressed in brain compared to heart, according to edgeR analysis:
awk '$NF<0.01 && $2<-10{print $1"\tover_brain_X_heart"}' edgeR.cpm1.n2.brain_X_heart.tsv > edgeR.0.01.over_brain_X_heart.txt
Write a list of the genes overexpressed heart compared to brain, according to edgeR analysis:
awk '$NF<0.01 && $2>10 {print $1"\tover_heart_X_brain"}' edgeR.cpm1.n2.brain_X_heart.tsv > edgeR.0.01.over_heart_X_brain.txt
Count how many overexpressed genes there are in each stage:
wc -l edgeR.0.01.over*.txt
Show the results in a heatmap:
awk '$3=="gene"{ match($0, /gene_id "([^"]+).+gene_type "([^"]+)/, var); print var[1],var[2] }' OFS="\t" /tutorial/refs/gencode.vM4.gtf \
| join.py --file1 stdin \
--file2 <(cat edgeR.0.01.over*.txt) \
| sed '1igene\tedgeR\tgene_type' > gene.edgeR.tsv
cut -f1 gene.edgeR.tsv \
| tail -n+2 \
| selectMatrixRows.sh - ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
| ggheatmap.R --width 5 \
--height 8 \
--col_metadata /tutorial/data/gene.quantifications.index.tsv \
--colSide_by tissue \
--col_labels labExpId \
--row_metadata gene.edgeR.tsv \
--merge_row_mdata_on gene \
--rowSide_by edgeR,gene_type \
--row_labels none \
--log \
--pseudocount 0.1 \
--col_dendro \
--row_dendro \
--matrix_palette /tutorial/palettes/palDiverging.txt \
--colSide_palette /tutorial/palettes/palTissue.txt \
--output heatmap.brain_X_heart.pdf
4.5. GO enrichment
Prepare a file with the list of all the genes in the annotation:
awk '{split($10,a,/\"|\./); print a[2]}' /tutorial/refs/gencode.vM4.gtf | sort -u > universe.txt
Launch the GO enrichment script for the Biological Processes, Molecular Function and Cellular Components in the set of genes overexpressed in brain:
# BP
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
--genes stdin \
--categ BP \
--output edgeR.over_brain_X_heart \
--species mouse
# MF
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
--genes stdin \
--categ MF \
--output edgeR.over_brain_X_heart \
--species mouse
# CC
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
--genes stdin \
--categ CC \
--output edgeR.over_brain_X_heart \
--species mouse
The results can be visualized in a web browser, using the online Gene Ontology visualizer REVIGO. As an example, prepare REVIGO input using Biological Process (BP) results:
awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.over_brain_X_heart.BP.tsv
Copy the above command output and paste it in the text box on the main page of the REVIGO web site. Then click the button Start Revigo and have a look at the results.
You can repeat the same for the genes overexpressed in heart.
5. Isoform level RNA-seq analysis
Create a directory dedicated to the splicing analyses within the tutorial
folder:
mkdir /tutorial/splicing
and move to it:
cd /tutorial/splicing
5.1. Prepare input files
To generate alternative splicing (AS) events, we need the genome annotation restricted to exon features in the third field of the gtf file. In addition to that, we will filter it for protein-coding transcripts.
# List of transcript ids
awk '$3=="transcript" && $0~/gene_type "protein_coding"/{ match($0, /transcript_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf |sort -u > protein_coding_transcript_IDs.txt
# Genome annotation restricted to exon features and filtered by transcript type
cat /tutorial/refs/gencode.vM4.gtf |awk '$3=="exon"' |grep -Ff protein_coding_transcript_IDs.txt > exon-annot.gtf
Also restrict the transcript expression matrix (TPM) to only proteing-coding transcripts. Refer to section Transcript and gene expression quantification. And generate individual files per tissue.
# Filter transcript TPM matrix
selectMatrixRows.sh protein_coding_transcript_IDs.txt /tutorial/quantifications/encode.mouse.transcript.TPM.idr_NA.tsv > pc-tx.tsv
# Individual transcript expression matrices
for tissue in Brain Heart Liver; do
selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 pc-tx.tsv > expr.${tissue}.tsv
done
5.2. Generate local AS events
Extract alternative splicing events from a given genomic annotation of exon-intron gene coordinates. Types of events: Skipping Exon (SE), Alternative 5' (A5) or 3' (A3) splice sites (SS), Mutually Exclusive Exons (MX), Retained intron (RI) and Alternative First (AF)/Last (AL) Exons (FL). Check SUPPA detailed description here.
suppa.py generateEvents -i exon-annot.gtf -e SE SS MX RI FL -o localEvents -f ioe
Check how many local events were generated per type:
wc -l localEvents*ioe
Generate a barplot with the number of events found in the genome annotation. Use the R wrapper ggbarplot.R
.
Check parameters by typing ggbarplot.R --help .
|
5.3. Compute percent spliced in index (PSI) values for local events
Compute PSI for local single exon skipping events (SE):
event=SE; suppa.py psiPerEvent --total-filter 10 --ioe-file localEvents_${event}_strict.ioe --expression-file pc-tx.tsv -o PSI-${event}
Create individual files per tissue:
event=SE; for tissue in Brain Heart Liver ;do selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 PSI-${event}.psi > ${tissue}.${event}.psi;done
Generate a barplot with the number of PSI values of SE events computed for both replicates of each tissue. Why these values vary per tissue?
Use the R wrapper ggbarplot.R
. Color the bars by tissues using the palette palTissue.txt
located in the /tutorial/palettes
folder.
Check parameters by typing ggbarplot.R --help .
|
5.4. Differential splicing analysis for local events
Compute differential splicing analysis for SE events, comparing all tissues against all and performing multiple testing correction:
event=SE; suppa.py diffSplice --method empirical --input localEvents_${event}_strict.ioe --psi Brain.${event}.psi Heart.${event}.psi Liver.${event}.psi --tpm expr.Brain.tsv expr.Heart.tsv expr.Liver.tsv -c -gc -o DS.${event}
Look at top cases and play with thresholds of p-value and delta PSI:
event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.7 || $2<-0.7) && $3<0.05{print}' DS.${event}.dpsi
event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $4!="nan" && ($4>0.7 || $4<-0.7) && $5<0.05{print}' DS.${event}.dpsi
event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $6!="nan" && ($6>0.7 || $6<-0.7) && $7<0.05{print}' DS.${event}.dpsi
5.4.1. Generate a heatmap with best examples
Use the PSI vectors .psivec
files and the R wrapper ggheatmap.R
.
5.4.2. Look at examples at UCSC genome browser
Have a look at the examples in the heatmap on the UCSC genome browser.
Highlight the skipped exon and zoom out to look at the entire gene loci.
5.5. Functional analysis
Search for functional information of differentially spliced transcripts, using ENSEMBL and other databases. Check for domains, use transcript comparison tool, phenotypes, gene expression, etc.
5.6. Perform PSI calculation and differential splicing analysis for other local event types
Repeat every analysis steps performed for SE events to other AS events. Start by checking MX and AF events.
5.6.1. Generate plots and tables summarizing the analysis performed
Generate a barplot with the number of PSI values computed per tissue (both replicates) per local AS event.
Generate a barplot with the number of top events per tissue were foudn at a chosen threshold of delta PSI (p-value<0.05).
Plots should be visualized outside the Docker container. |
Perform other summary statistics of the AS events, e.g. number of events found per chromosome or in the different strands.
6. Regulation of gene and isoform expression
6.1. ChIP-seq data processing
6.1.1. Fastq files and read QC
Move back to the main folder:
cd /tutorial
Create a folder for the fastqc output:
mkdir -p fastqc
Run fastqc:
fastqc -o fastqc -f fastq /tutorial/data/mouse_brain_H3K4me3.fastq.gz
You can view fastqc results on the browser outside the container:
fastqc/mouse_brain_H3K4me3_fastqc.html
6.1.2. Mapping
Create a folder for the alignment steps:
mkdir chip-alignments
Create the genome index for STAR:
mkdir chip-genomeIndex
STAR --runThreadN 2 \
--runMode genomeGenerate \
--genomeDir chip-genomeIndex \
--genomeFastaFiles /tutorial/refs/genome.fa \
--genomeSAindexNbases 11
If you are unable to run the index command, please download the index wget public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/extra_data/chip-genomeIndex.tar.gz , followed by tar -zxvf chip-genomeIndex.tar.gz .
|
Run the following command to perform the mapping:
STAR --runThreadN 2 \
--genomeDir chip-genomeIndex \
--readFilesIn /tutorial/data/mouse_brain_H3K4me3.fastq.gz \
--outSAMunmapped Within \
--outFilterMultimapNmax 10 \
--alignSJoverhangMin 1000 \
--alignIntronMax 1 \
--outSAMattributes NH HI AS NM MD \
--readFilesCommand pigz -p2 -dc \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix chip-alignments/mouse_brain_H3K4me3_
Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help ).
|
Which parameters had to be changed in order to run ChIP-seq data (both for the generation of the index and for the mapping astep itself)?
Are the number and the set os input files the same?
When finished we can look at the bam file:
samtools view -h chip-alignments/mouse_brain_H3K4me3_Aligned.sortedByCoord.out.bam | more
Type q to exit the view.
or at the mapping statistics that come with STAR
:
cat chip-alignments/mouse_brain_H3K4me3_Log.final.out
Started job on | Dec 27 13:33:10 Started mapping on | Dec 27 13:33:10 Finished on | Dec 27 13:33:15 Mapping speed, Million of reads per hour | 72.00 Number of input reads | 100000 Average input read length | 50 UNIQUE READS: Uniquely mapped reads number | 94051 Uniquely mapped reads % | 94.05% Average mapped length | 50.30 Number of splices: Total | 0 Number of splices: Annotated (sjdb) | 0 Number of splices: GT/AG | 0 Number of splices: GC/AG | 0 Number of splices: AT/AC | 0 Number of splices: Non-canonical | 0 Mismatch rate per base, % | 1.33% Deletion rate per base | 0.02% Deletion average length | 1.00 Insertion rate per base | 0.00% Insertion average length | 1.08 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 2199 % of reads mapped to multiple loci | 2.20% Number of reads mapped to too many loci | 3239 % of reads mapped to too many loci | 3.24% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 0.46% % of reads unmapped: other | 0.05% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
6.1.3. Peak calling
Create a folder for the peak calling output:
mkdir peakOut
Run MACS2:
macs2 callpeak --treatment chip-alignments/mouse_brain_H3K4me3_Aligned.sortedByCoord.out.bam \
--name mouse_brain_H3K4me3 \
--outdir peakOut \
--format BAM \
--gsize mm \
--qvalue 0.01 \
--nomodel \
--extsize=200 \
--bdg \
--SPMR
check MACS2 options by typing macs2 callpeak --help
|
You can view results:
ll -tr peakOut
Check the content of each file and the explanation of each column at MACS2 github page.
6.1.4. ChIP-seq signal files
MACS2 produces two Bedgraph files: one called treat_pileup
, the other is control_lambda
.
Sort the Bedgraph file using bedtools:
bedtools sort -i peakOut/mouse_brain_H3K4me3_treat_pileup.bdg >\
peakOut/mouse_brain_H3K4me3_treat_pileup_sorted.bdg
We convert the treat_pileup
BedGraph file to BigWig:
bedGraphToBigWig peakOut/mouse_brain_H3K4me3_treat_pileup_sorted.bdg \
/tutorial/refs/genome.fa.fai \
peakOut/mouse_brain_H3K4me3_pileup.bw
Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (in this case simply type bedGraphToBigWig ).
|
6.1.5. Visualize ChIP-seq signal in the UCSC genome browser
Add the signal track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)
Go to the UCSC genome browser web page:
On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:
genome: Mouse assembly: Dec 2011 (GRCm38/mm10) [mm10]
Paste the track specifications in the box Paste URLs or data
:
track autoScale=on name=ChIPseq,brain,CHIPembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain1.bw track autoScale=on name=ChIPseq,brain,CHIPembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain2.bw track autoScale=on name=ChIPseq,heart,CHIPembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoHeart1.bw track autoScale=on name=ChIPseq,heart,CHIPembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoHeart2.bw track autoScale=on name=ChIPseq,liver,CHIPembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver1.bw track autoScale=on name=ChIPseq,liver,CHIPembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver2.bw
Click Submit
Go to the genome browser to look at some genes and their RNA-seq and ChIP-seq signals.
6.2. ChIP-seq data analysis
Create a directory dedicated to the analyses:
mkdir chip-analysis
And move into it:
cd chip-analysis
Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. bedtools --help ).
|
6.2.1. Peak calling statistics
Have a look at the number of peaks called per tissue:
wc -l /tutorial/results/*narrowPeak
6.2.2. Compare peaks
Use bedtools intersect
or bedops
to identify peaks specific to some tissue(s).
Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. bedtools --help ).
|
6.2.3. Genomic location
Create a BED file with 200bp up and downstream the transcription start site (TSS) of every protein-coding transcript (e.g. use transcript ids from /tutorial/splicing/protein_coding_transcript_IDs.txt
section Prepare input files).
Check BED section for further information of file format.
6.2.4. Differential peaks
Intersect the TSS BED file with the differential peaks identified for each tissue.
Use bedtools intersect or bedops for this task.
|
Look at the UCSC genome browser the differential peaks you defined.
Upload the differential peak BED file in the UCSC genome browser. Find instructions here.
6.3. Multi-omics data analysis
6.3.1. Promoter regions of differentially expressed genes
Use bash commands to intersect the list of differentially expressed genes (DEG) with the differential H3K4me3 peaks computed in section Differential peaks.
Look at promoter regions of DEGs with and without differential peaks of H3K4me3 at the UCSC genome browser.
Do you see a relationship between H3K4me3 marking and gene expression?
How frequent is it? Use your differential peaks and DEGs to compute it.
6.3.2. ENCODE portal
Search in the ENCODE portal www.encodeproject.org/ the dataset we have been using in this hands-on.
Refer to Introduction for the information of this dataset.
Refer to getting started to learn about the portal.
Explore the data available using the Data
→ Search
menu or by clincking in the graphics in the home page.
Explore the information available in the Material & Methods
menu.
Look at standards and guidelines.
Look at software and pipelines.
Which data formats are available for download?
Which other omics data is available for the precise same tissue and developmental stage we are using in this hands-on?
Find the signal files of ATAC-seq samples.
6.3.3. Visualize ATAC-seq signal in the UCSC genome browser
Add the signal track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)
Go to the UCSC genome browser web page:
On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:
genome: Mouse assembly: Dec 2011 (GRCm38/mm10) [mm10]
Paste the track specifications in the box Paste URLs or data
:
track autoScale=on name=ATACseq,brain,ATACembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain1.bw track autoScale=on name=ATACseq,brain,ATACembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain2.bw track autoScale=on name=ATACseq,heart,ATACembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoHeart1.bw track autoScale=on name=ATACseq,heart,ATACembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoHeart2.bw track autoScale=on name=ATACseq,liver,ATACembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver1.bw track autoScale=on name=ATACseq,liver,ATACembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver2.bw
Click Submit
Go to the genome browser to look at some genes and their RNA-seq, ChIP-seq and ATAC-seq signals.
Look again at the promoter regions of the most interesting examples of DEGs from section Promoter regions of differentially expressed genes.
ATAC-seq signal varies among tissues?
ATAC-seq signals and H3K4me3 signals overlap?
Do signals (ATAC-seq and ChIP-seq vary accordingly)?
6.3.4. Promoter regions of differentially spliced genes
Use bash commands to intersect the list of differentially spliced genes with the differential H3K4me3 peaks computed in section Differential peaks.
Look at promoter regions of alternative first exon events at the UCSC genome browser.
Do you see a relationship between chromatin features (H3K4me3 marking and the open chromatin) with the isoform expression in different tissues?
6.3.5. Omics portals
Exploring the ENCODE portal, which other omics data would you include in your analysis to follow up the study of the differentially expressed/spliced genes?
Propose other analysis that could be performed using the ENCODE data, for instance, related to enhancers or the evolution.
Explore the portal of other omics projects to find out the datasets available, for instance: