Data Analysis Hands-on

1. Introduction

The RNA-seq, ChIP-seq and ATAC-seq data used for the analysis comes from forebrain, heart and liver of mouse embryo samples at 12.5 days of development, done in 2 bioreplicates (1% of the reads are randomly sampled here). This data is provided by ENCODE: Encyclopedia of DNA Elements (www.encodeproject.org/).

For time and memory restrictions, only one sample is used for the data processing. This sample has been built with a hundred thousand alignment-based pre-filtered reads. A small sample of the genome and annotation (21 chromosomes, 1Mb long) is used as the reference.

Browse the content using the left panel.

A copy of the slides of the course can be found here: Introduction; Basic concepts; RNA-seq processing; Gene level RNA-seq analysis; Isoform level RNA-seq analysis; Multiomics.

2. Basic concepts and setup

2.1. Hands-on setup (done once)

We are going to run all the commands of the hands-on within a Docker container using basic Linux commands and scripts from Git.

2.1.1. Bash shell

Linux and Mac: The Bash shell is available on Linux and Mac OS.

Windows: Use VirtualBox or VMWare player to import this virtual machine with Ubuntu 18.04 and Docker pre-installed. Follow the instructions provided by Diego Garrido here.

Access your Bash shell console and test the commands below.

2.1.2. Basic Linux commands

Browse the directory structure

pwd

tells you where you are

ls

list the content of the current directory

ls <directory name>

list the content of a directory

cd <directory name>

go to the specified directory

cd ~ (or cd)

go to your home directory

cd ..

go to the parent directory

tree <directory name>

list the content of a directory in a tree-like format

mkdir <directory name>

creates specified directory

View the content of a file

less, more

view text with paging

head

prints first lines of a file

tail

prints last lines of a file

cat

print content of a file into the screen

zcat

print content of a gzip compressed file

File manipulations

rm <file name>

remove file

cp <file1> <file2>

copy file1 into file2

mv <file1> <file2>

rename file1 to file2

Some other useful commands

grep <pattern>

show lines of text containing a given pattern

grep -v <pattern>

show lines of text not containing a given pattern

sort

sort linesof text files

wc

counting words, lines and characters

> (output redirection)

allows to redirect the output to a file

| (pipe)

allows to send output from one program to another

cut

to extract portion of a file by selecting columns

echo

input a line of text and display it on standard output

AWK programming

AWK - UNIX shell programming language. A fast and stable tool for processing text files.

awk '/www/ { print $0 }' <file>

search for the pattern 'www' in the each line of the file

awk '$3=="www"' <file>

search for pattern 'www' in the third column of the file

awk 'length($0) > 80' <file>

print every line in the file that is longer than 80 characters

awk 'NR % 2 == 0' <file>

print even-numbered lines in the file

Some built-in variables

NR

Number of records

NF

Number of fields

FS

Field separator character

OFS

Output field separator character

See www.grymoire.com/Unix/Awk.html and www.tutorialspoint.com/awk/awk_basic_examples.htm for more information

Writing and editing files

There are several ways to write and edit files. The most commonly used text editors are vi, emacs and nano.

Vi

Vi is a text editor with a command mode interface. See this article for a beginner’s guide.

To create a new file in the current folder just call vi with the name of the file:

vi file.txt

Vi has several modes which are specified in the bottom part of screen. When you just open the editor you cannot write on the file, because it is the command mode. To start writing, press i, and you will notice that the status in the bottom of the screen changed to INSERT. Now you can paste the text by right-click with the mouse and paste, or by pressing SHIFT+Insert. To go back to command mode press ESC. To save, make sure you are in command mode and type:

:wq

To undo, press u when you are in command mode.

Emacs

Emacs is a text editor with a command mode interface. See this article for a beginner’s guide.

emacs file.txt

When you are done, save the file (by pressing Ctrl+x Ctrl+s) and close Emacs (by pressing Ctrl+x Ctrl+c).

Nano

Nano is a text editor with a command mode interface. See this article for a beginner’s guide.

To create a new file in the current folder just call nano with the name of the file:

nano file.txt

When you are done, close Nano (by pressing Ctrl+Shift+X) and press Y for saving the changes.

2.1.3. Docker

Access your Bash shell console and test the commands in the cheatsheet below.

General commands

docker -d

start the docker daemon

docker --help

get help with Docker. Can also use –help on all subcommands

docker info

display system-wide information

Images

docker images

list local images

docker rmi <image_name>

delete an Image

docker image prune

remove all unused images

Containers

docker run --name <container_name> <image_name>

create and run a container from an image, with a custom name

docker ps

to list currently running containers

docker ps --all

list all docker containers (running and stopped)

docker container stats

view resource usage stats

See docs.docker.com/get-started/docker_cheatsheet.pdf for more information

Pull the image from DockerHub:

docker pull ceciliaklein/teaching:uvic

Run useful commands to check the images available.

2.1.4. Data

All the data needed for the course is stored in the following directory:

https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/data/tutorial.tar.gz

Download and uncompress the data:

wget https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/data/tutorial.tar.gz
tar -zxvf tutorial.tar.gz

We are going to use the folder tutorial as our working directory. Move to it:

cd tutorial

2.1.5. Scripts from GitHub

The scripts used in this course are available in GitHub:

git clone https://github.com/CeciliaKlein/teaching-utils.git

Review Git hands-on by Diego Garrido here.

Check which scripts and R wrappers are available and the extension of each file.

ls teaching-utils

Which are extensions are available? Check the overall content of some file types by using the basic Linux commands.

2.2. Run container (repeat every restart of hands-on)

We are going to run all the commands of the hands-on within a Docker container. Check the path to the folder where you saved the data for the course.

In order to run the container shell you need to run the following command in your terminal:

Replace /path/to/tutorial to the path where the tutorial data is stored in your computer. Use pwd to check your current location.

docker run -i -t -v /path/to/tutorial:/tutorial -w /tutorial ceciliaklein/teaching:uvic

You should get a similar output:

(main) root@516a2307ed2b:/tutorial#

To access scripts without indication the full path, add /teaching-utils folder to the PATH environment variable:

export PATH=$PATH:/tutorial/teaching-utils/;

Confirm the previous commands worked by looking at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. edgeR.analysis.R --help).

We need to run the commands in this section Run container (repeat every restart of hands-on) (i.e. docker run and export) every time we restart the hands-on.

2.3. Reference gene annotation

Look at the mouse gene annotation (ensembl v65, long genes):

more /tutorial/refs/gencode.vM4.gtf

Type q to exit the view.

Store information about gene Sec23a in a separate file:

grep Sec23a /tutorial/refs/gencode.vM4.gtf > Sec23a.gtf

Count the number of transcripts of this gene:

awk '$3=="transcript"' Sec23a.gtf | wc -l

Count the number of exons of this gene:

awk '$3=="exon"' Sec23a.gtf | wc -l

What is the biotype of the gene?

What is the biotype of each transcript of this gene?

2.3.1. Extract gene identifiers

Sometimes it is useful to get a list of gene identifiers from the annotation. For example, using the gene_type GTF tag:

# select protein-coding gene ids
awk '$3=="gene" && $0~/gene_type "protein_coding"/{ match($0, /gene_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf | sort > protein_coding_IDs.txt

# select long non-coding gene ids
awk '$3=="gene" && $0~/gene_type "lincRNA"/{ match($0, /gene_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf | sort > lincRNA_IDs.txt

2.4. Common file formats

2.4.1. FASTA

FASTA is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.

Let’s have a look at the genome FASTA file we user in the workshop:

more /tutorial/refs/genome.fa

>chr1 (1)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (2)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

1	sequence description, can be a long description (see below)
2	raw sequence

Type q to exit the view.

The word following the > symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the > and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a > appears; this indicates the start of another sequence.

2.4.2. FASTQ

FASTQ is a text-based format for storing biological sequences and their corresponding quality scores.

Have a look at one of the FASTQ files for the workshop:

zcat /tutorial/data/mouse_brain_rep1_1.fastq.gz | head -4

@HWI-ST985:73:C08BWACXX:8:1101:1920:2006 1:N:0: (1)
NTGCTCGGCCTCTTTCAGCTGTTTCTGCAGCTGCTGAATATCACTGTCTCTCTTCTCTACTTCTTTCTCTAAAGCCTGCATTTCGTGGTGAACTTTTCCCT (2)
+ (3)
#1=DDFFFHHHHHJJHIIJJJJJJJJJJJIIJGIJJJGIFIGEIGIGHIIIJJIJJIJIJIJJJJIJHJJHHIIIHHHGHHDFBCDCDBCCDDDDDDDDDD (4)

1	sequence id - begins with the `@` character and is followed by a sequence identifier and an optional description
2	raw sequence
3	begins with the `+` character and is optionally followed by the same sequence identifier (and any description) again
4	quality - encodes the quality values for the sequence and must contain the same number of symbols

The fourth line can also begin with @ depending on the quality encoding (see below)

Quality encoding

A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). The most used formula is the Phred quality score:

\$Q_(phred) = -10log_10(p)\$

offset	max Phred score range	max ASCII range	real-world Phred score range	real-world ASCII range
33	0 - 93	33 - 126	0 - 40	33 - 73
64	0 - 62	64 - 126	0 - 40	64 - 104

FASTQ quality encoding on Wikipedia

2.4.3. SAM

Sequence Alignment/Map format is a TAB-delimited text format for storing sequence alignment information. See here for the full format specification.

Header

SAM can have an optional header section. If present it should prior to the alignments. Header lines start with @, while alignment lines do not.

@HD

The HD header line describes the header with format version (VN) and sorting order of alignments (SO).

@HD     VN:1.4  SO:coordinate

@SQ

The @SQ header line describe the reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.

@SQ     SN:chr1 LN:197195432
@SQ     SN:chr10        LN:129993255
...

@PG

The @PG line describes the program used to generate the file. It contains the program id (ID), name (PN), version (VN) and command line (CL).

@PG     ID:STAR PN:STAR VN:STAR_2.4.2a  CL:STAR   --runThreadN 2   --genomeDir refs/mouse_genome_mm9_STAR_index   --readFilesIn data/mouse_cns_E14_rep1_1.fastq.gz   data/mouse_cns_E14_rep1_2.fastq.gz      --readFilesCommand pigz   -p2   -dc      --outFileNamePrefix mouse_cns_E14_rep1   --outSAMtype BAM   SortedByCoordinate      --outSAMattributes NH   HI   AS   NM   MD      --outSAMunmapped Within   --outFilterType BySJout   --quantMode TranscriptomeSAM

Alignments

Each alignment line represents the linear alignment of a sequence. Each line has 11 mandatory fields. These fields always appear in the same order and must be present, but their values can be 0 or * (depending on the field) if the corresponding information is unavailable. Each line can also have optional extra fields.

Alignment fields in the SAM format

HWI-ST985:73:C08BWACXX:6:1103:3691:25179 (1)
99 (2)
chr1 (3)
3157636 (4)
255 (5)
101M (6)
= (7)
3157698 (8)
163 (9)
GTTTGGTAAGAAAAACAACTTAATTACTACTTCAGAATTTATTTCATTTTTTTTCTAAAGAGTAAGAAGGAAGATTCTATGTTCACATATTTTGGTGCTCA (10)
BB@FFFDFFHHHGJJIIJJJJIHIIIJJIIJJJCHFGHHJDHIJIJIJIJJJJIIGGHGHAE>EEHGHFFEFFEECCCDDEDEEEDDDDEEEEDBBDCCCC (11)
NH:i:1  HI:i:1  AS:i:200  NM:i:0  MD:Z:101 (12)

1	sequence id
2	flag
3	reference sequence name
4	alignment position (`1-based`)
5	mapping quality
6	CIGAR string
7	reference sequence name of the primary alignment of the mate (`=` if it is the same as the current alignment)
8	position of the primary alignment of the mate
9	observed fragment length
10	sequence
11	quality
12	optional fields

CIGAR string

most common CIGAR abbreviations:

alignment match

Insertion to the reference

Deletion from the reference

Skipped region

Soft clipping, clipped sequences present in SEQ

Hard clipping, clipped sequences present in SEQ

101M        == entire read was mapped without gaps, continuous mapping
51M1217N50M == read was mapped with 217nt gap, split mapping

BAM

BAM is a BGZF compressed version of the SAM format. BGZF is block compression implemented on top of the standard gzip file format. The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is ‘gunzip compatible’, in the sense that a compliant gunzip utility can decompress a BGZF compressed file.

Random access

BGZF files support random access through the BAM file index. Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments. BAM must be sorted by the reference ID and then the leftmost coordinate before. The standard format for indexing BAM files is the BAI format.

The BAI format is based on an underlying binning scheme. Each bin represents a contiguous genomic region which is either fully contained in or non-overlapping with another bin; each alignment is associated with a bin which represents the smallest region containing the entire alignment. To find the alignments that overlap a specified region, we need to get the bins that overlap the region, and then test each alignment in the bins to check overlap.

Many programs need an accompanying BAI index for every input BAM. They usually assume the index is located next to the input file in the filesystem and has the same name as the input file, with the .bai extension appended. E.g.:

mouse_brain_rep1_m4_n10_toGenome.bam
mouse_brain_rep1_m4_n10_toGenome.bam.bai

We will have a look at the contents of a BAM file in the [GRAPE] section.

CRAM

CRAM is a framework technology with highly efficient and tunable reference-based compression of sequence data and a data format that is directly available for computational use. It supports different compression formats (gzip, bzip2, CRAM records). CRAM records use a wide range of lossless and lossy encoding strategies. See here for the full format specification.

2.4.4. GTF

GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.

The attribute list must begin with the two mandatory attributes:

gene_id

A globally unique identifier for the genomic source of the sequence.

transcript_id

A globally unique identifier for the predicted transcript.

Check the nineth GTF field of the sample annotation of the workshop:

cut -f9 refs/gencode.vM4.gtf | head -1

gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";

Look at the first line:

head -1 refs/gencode.vM4.gtf | tr '\t' '\n'

chr1 (1)
HAVANA (2)
gene (3)
3073253 (4)
3074322 (5)
. (6)
+ (7)
. (8)
gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1"; (9)

1	`seqname` - The name of the sequence. Must be a chromosome or scaffold.
2	`source` - The program that generated this feature.
3	`feature` - The name of this type of feature. Some examples of standard feature types are `CDS`, `start_codon`, `stop_codon`, and `exon`.
4	`start` - The starting position of the feature in the sequence (`1-based`).
5	`end` - The ending position of the feature (inclusive).
6	`score` - A score between 0 and 1000.
7	`strand` - Valid entries include `+`, `-`, or `.` (for don’t know/don’t care).
8	`frame` - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be `.`.
9	`group` - All lines with the same group are linked together into a single item. Note the `gene_id` and `transcript_id` mandatory attributes.

Feature types

The third field in the GFF specification represents feature type for the line. In addition to the standard features mentioned above there can be exon, gene and transcript.

Look for the first of each feature in the sample annotation of the workshop:

awk '$3=="exon"' refs/gencode.vM4.gtf | head -1

chr1	HAVANA	exon	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; exon_number 1; exon_id "ENSMUSE00001343744.1"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

awk '$3=="gene"' refs/gencode.vM4.gtf | head -1

chr1	HAVANA	gene	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";

awk '$3=="transcript"' refs/gencode.vM4.gtf | head -1

chr1	HAVANA	transcript	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_status "KNOWN"; transcript_name "4933401J01Rik-001"; level 2; transcript_support_level "NA"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

2.4.5. BED

BED (Browser Extensible Data) format provides a flexible way to represent genomic regions. It is also widely used to define the data lines that are displayed in an annotation track of the UCSC genome browser. BED lines have three required fields and nine additional optional fields. For this reason there are several BEDn format (e.g. BED9, BED12) depending on the number of optional fields. The number of fields per line must be consistent throughout any single file.

The first three required BED fields are:

chrom

The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

chromStart

The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

chromEnd

The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The 9 additional optional BED fields are:

name
score
strand
thickStart
thickEnd
itemRgb
blockCount
blockSizes
blockStarts

BedGraph

The bedGraph format allows display of continuous-valued data. This display type is useful for values like probability scores and transcriptome data. It is also widely used to define the data lines that are displayed in a track of the UCSC genome browser. The bedGraph format is also an older format used to display sparse data or data that contains elements of varying size.

The bedGraph format is line-oriented. BedGraph data could be preceeded by a track definition line, which adds a number of options for controlling the default display of this track.

Following the track definition line are the track data in four column BED format.

chromA  chromStartA  chromEndA  dataValueA
chromB  chromStartB  chromEndB  dataValueB

BedGraph data values can be integer or real, positive or negative values. Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data. All positions specified in the input data must be in numerical order.

2.4.6. BigWig

The bigWig format is for display of dense, continuous data. BigWig data can be displayed in the UCSC Genome Browser as a graph. BigWig files are created initially from bedGraph or wiggle type files. The resulting bigWig files are in an indexed binary format. The main advantage of the bigWig files is that only the portions of the files needed to display a particular region are transferred to UCSC, so for large data sets bigWig is considerably faster than regular files. The bigWig file remains on your web accessible server (http, https, or ftp), not on the UCSC server. Only the portion that is needed for the chromosomal position you are currently viewing is locally cached as a "sparse file".

We will see an example of BigWig usage in the section Visualize RNA-seq signal in the UCSC genome browser.

Wiggle

The wiggle (WIG) format is an older format for display of dense, continuous data such as GC percent, probability scores, and transcriptome data. Wiggle data elements must be equally sized.

For speed and efficiency, wiggle data is compressed and stored internally in 128 unique bins. This compression means that there is a minor loss of precision when data is exported from a wiggle track in the UCSC table browser. The bedGraph format should be used if it is important to retain exact data when exporting.

Wiggle format is line-oriented. For wiggle custom tracks in the UCSC browser, the first line must be a track definition line, which designates the track as a wiggle track and adds a number of options for controlling the default display. Wiggle format is composed of declaration lines and data lines, and require a separate wiggle track definition line. There are two options for formatting wiggle data: variableStep and fixedStep. These formats were developed to allow the file to be written as compactly as possible.

variableStep is for data with irregular intervals between new data points and is the more commonly used wiggle format. The variableStep section begins with a declaration line and is followed by two columns containing chromosome positions and data values:

variableStep  chrom=chrN  [span=windowSize]
chromStartA  dataValueA
chromStartB  dataValueB

fixedStep is for data with regular intervals between new data values and is the more compact wiggle format. The fixedStep section begins with a declaration line and is followed by a single column of data values:

fixedStep  chrom=chrN  start=position  step=stepInterval  [span=windowSize]
dataValue1
dataValue2

Wiggle track data values can be integer or real, positive or negative values. Chromosome positions are specified as 1-relative. For a chromosome of length N, the first position is 1 and the last position is N. Only positions specified have data. All positions specified in the input data must be in numerical order.

Extracting Data from the bigWig Format

Because the bigWig files are indexed binary files, they can be difficult to extract data from. Consequently, the following programs developed at the UCSC can be used to extract data from bigWig files:

bigWigToBedGraph

this program converts a bigWig file to ASCII bedGraph format.

bigWigToWig

this program converts a bigWig file to wig format.

bigWigSummary

this program extracts summary information from a bigWig file.

bigWigAverageOverBed

this program computes the average score of a bigWig over each bed, which may have introns.

bigWigInfo

this program prints out information about a bigWig file.

3. RNA-seq data processing

3.1. Fastq files and read QC

Create a folder for the fastqc output:

mkdir -p fastqc

Run fastqc:

fastqc -o fastqc -f fastq /tutorial/data/mouse_brain_rep1_1.fastq.gz

You can view fastqc results on the browser outside the container:

fastqc/mouse_brain_rep1_1_fastqc.html

3.2. Mapping

Create a folder for the alignment steps:

mkdir alignments

Create the genome index for STAR:

mkdir genomeIndex

STAR --runThreadN 2 \
     --runMode genomeGenerate \
     --genomeDir genomeIndex \
     --genomeFastaFiles /tutorial/refs/genome.fa \
     --sjdbGTFfile /tutorial/refs/anno.gtf \
     --genomeSAindexNbases 11

If you are unable to run the index command, please download the index wget public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/extra_data/genomeIndex.tar.gz, followed by tar -zxvf genomeIndex.tar.gz.

Run the following command to perform the mapping:

STAR --runThreadN 2 \
     --genomeDir genomeIndex \
     --readFilesIn /tutorial/data/mouse_brain_rep1_1.fastq.gz \
                   /tutorial/data/mouse_brain_rep1_2.fastq.gz \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outSAMattributes NH HI AS NM MD \
     --readFilesCommand pigz -p2 -dc \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode TranscriptomeSAM \
     --outFileNamePrefix alignments/mouse_brain_rep1_

Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help).

When finished we can look at the bam file:

samtools view -h alignments/mouse_brain_rep1_Aligned.sortedByCoord.out.bam | more

Type q to exit the view.

or at the mapping statistics that come with STAR:

cat alignments/mouse_brain_rep1_Log.final.out

                                 Started job on |	Dec 27 16:01:03
                             Started mapping on |	Dec 27 16:01:04
                                    Finished on |	Dec 27 16:02:16
       Mapping speed, Million of reads per hour |	5.00

                          Number of input reads |	100000
                      Average input read length |	202
                                    UNIQUE READS:
                   Uniquely mapped reads number |	90162
                        Uniquely mapped reads % |	90.16%
                          Average mapped length |	200.46
                       Number of splices: Total |	37153
            Number of splices: Annotated (sjdb) |	1851
                       Number of splices: GT/AG |	36794
                       Number of splices: GC/AG |	190
                       Number of splices: AT/AC |	2
               Number of splices: Non-canonical |	167
                      Mismatch rate per base, % |	0.23%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.59
                        Insertion rate per base |	0.00%
                       Insertion average length |	1.39
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	1713
             % of reads mapped to multiple loci |	1.71%
        Number of reads mapped to too many loci |	100
             % of reads mapped to too many loci |	0.10%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |	0.00%
                 % of reads unmapped: too short |	7.96%
                     % of reads unmapped: other |	0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

3.3. RNA-seq signal files

We user STAR to generate BedGraph files from the alignments. Launch the following command to run it:

STAR --runThreadN 2 \
     --runMode inputAlignmentsFromBAM \
     --inputBAMfile alignments/mouse_brain_rep1_Aligned.sortedByCoord.out.bam \
     --outWigType bedGraph \
     --outWigStrand Unstranded \
     --outFileNamePrefix alignments/mouse_brain_rep1_ \
     --outWigReferencesPrefix chr

In order to simplify this part of the hands-on we are processing the data as unstranded even if it is stranded.

STAR produces two files: one using uniquely mapped reads, the other using multimapped reads. We convert one of the BedGraph files to BigWig:

bedGraphToBigWig alignments/mouse_brain_rep1_Signal.Unique.str1.out.bg \
                 /tutorial/refs/genome.fa.fai \
                 alignments/mouse_brain_rep1_uniq.bw

Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help).

3.4. Visualize RNA-seq signal in the UCSC genome browser

Add the gene expression track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)

Go to the UCSC genome browser web page:

genome-euro.ucsc.edu

On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:

genome: Mouse    assembly: Dec 2011 (GRCm38/mm10)    [mm10]

Paste the track specifications in the box Paste URLs or data:

track autoScale=on name=RNAseq,brain,PRNAembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain1.bw
track autoScale=on name=RNAseq,brain,PRNAembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain2.bw
track autoScale=on name=RNAseq,heart,PRNAembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoHeart1.bw
track autoScale=on name=RNAseq,heart,PRNAembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoHeart2.bw
track autoScale=on name=RNAseq,liver,PRNAembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver1.bw
track autoScale=on name=RNAseq,liver,PRNAembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver2.bw

Click Submit

Go to the genome browser to look at some genes and their RNA-seq signal

3.5. Transcript and gene expression quantification

Create a folder for the quantifications:

mkdir quantifications

Create the transcriptome index for RSEM:

mkdir txIndex

rsem-prepare-reference --gtf /tutorial/refs/anno.gtf \
                      /tutorial/refs/genome.fa \
                      txIndex/RSEMref

Run the following command to estimete the expression:

rsem-calculate-expression --num-threads 2 \
                          --bam \
                          --paired-end \
                          --estimate-rspd \
                          --forward-prob 0 \
                          --no-bam-output \
                          --seed 12345 \
                          alignments/mouse_brain_rep1_Aligned.toTranscriptome.out.bam \
                          txIndex/RSEMref \
                          quantifications/mouse_brain_rep1

Here we are correctly treating the data as stranded.

Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help).

RSEM generates expression quantification at transcript and gene level. For the data analysis section we are going to use only gene expression data.

To obtain a matrix of gene TPM values:

cat /tutorial/data/gene.quantifications.index.txt \
| retrieve_element_rpkms.py --output encode \
                            --organism mouse \
                            --element gene \
                            --value TPM \
                            --output_dir quantifications

To obtain a matrix of gene read counts:

cat /tutorial/data/gene.quantifications.index.txt \
| retrieve_element_rpkms.py --output encode \
                            --organism mouse \
                            --element gene \
                            --value expected_count \
                            --output_dir quantifications

Generate de same two matrices for transcripts and check output files (e.g. by typing head or more).

Compare number of quantified genes and transcripts (e.g. by typing wc -l).

How many genes were quantified?

How many transcripts were quantified?

Is it the same number as in the annotation file /tutorial/refs/gencode.vM4.gtf?

4. Gene level RNA-seq data analysis

Create a directory dedicated to the analyses:

mkdir analysis

And move into it:

cd analysis

Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. edgeR.analysis.R --help).

4.1. Protein coding vs long-non coding

For the first part of the analysis section we are going to produce separate plots for protein coding and long non-coding genes. We will split the quantification matrix, using gene identifiers selected in 4.1:

# protein-coding
(head -1 ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv; grep -Ff ../protein_coding_IDs.txt ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv) > ../quantifications/encode.mouse.pc.gene.TPM.idr_NA.tsv

# long non-coding
(head -1 ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv; grep -Ff ../lincRNA_IDs.txt ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv) > ../quantifications/encode.mouse.linc.gene.TPM.idr_NA.tsv

4.2. TPM distribution

Have a look at the distribution of RPKM values:

rpkm_distribution.R --input_matrix ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
                    --log \
                    --pseudocount 0 \
                    --metadata ../data/gene.quantifications.index.tsv \
                    --fill_by tissue \
                    --palette /tutorial/palettes/palTissue.txt \
                    --output all_genes

To look at the plot outside the container:

boxplot.log_T.psd_0.all_genes.pdf

Look at the distributions of expression values for protein-coding and lincRNAs, using this command to subset the matrix:

selectMatrixRows.sh /tutorial/protein_coding_IDs.txt /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA.tsv > protein-coding.TPM.tsv

Repeat the generation of the boxplot with the new matrix (change the output label).

4.3. Clustering analysis

Perform hierarchical clustering to check replicability:

matrix_to_dist.R --input_matrix ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
                 --log10 \
                 --cor pearson \
                 --output stdout \
| ggheatmap.R --input_matrix stdin \
              --row_metadata /tutorial/data/gene.quantifications.index.tsv \
              --col_dendro \
              --row_dendro \
              --base_size 10 \
              --matrix_palette /tutorial/palettes/palSequential.txt \
              --rowSide_palette /tutorial/palettes/palTissue.txt \
              --rowSide_by tissue \
              --matrix_fill_limits 0.85,1 \
              --output samples_clustering.pdf

Look at the clustering.

4.4. Differential gene expression

Run the DE with the edgeR package (be careful takes read counts and not TPM values as input):

edgeR.analysis.R --input_matrix ../quantifications/encode.mouse.gene.expected_count.idr_NA.tsv \
                 --metadata /tutorial/data/gene.quantifications.index.tsv \
                 --fields tissue \
                 --coefficient 2 \
                 --output brain_X_heart

Write a list of the genes overexpressed in brain compared to heart, according to edgeR analysis:

awk '$NF<0.01 && $2<-10{print $1"\tover_brain_X_heart"}' edgeR.cpm1.n2.brain_X_heart.tsv > edgeR.0.01.over_brain_X_heart.txt

Write a list of the genes overexpressed heart compared to brain, according to edgeR analysis:

awk '$NF<0.01 && $2>10 {print $1"\tover_heart_X_brain"}' edgeR.cpm1.n2.brain_X_heart.tsv > edgeR.0.01.over_heart_X_brain.txt

Count how many overexpressed genes there are in each stage:

wc -l edgeR.0.01.over*.txt

Show the results in a heatmap:

awk '$3=="gene"{ match($0, /gene_id "([^"]+).+gene_type "([^"]+)/, var); print var[1],var[2] }' OFS="\t" /tutorial/refs/gencode.vM4.gtf \
| join.py --file1 stdin \
          --file2 <(cat edgeR.0.01.over*.txt) \
| sed '1igene\tedgeR\tgene_type' > gene.edgeR.tsv

cut -f1 gene.edgeR.tsv \
| tail -n+2 \
| selectMatrixRows.sh - ../quantifications/encode.mouse.gene.TPM.idr_NA.tsv \
| ggheatmap.R --width 5 \
              --height 8 \
              --col_metadata /tutorial/data/gene.quantifications.index.tsv \
              --colSide_by tissue \
              --col_labels labExpId \
              --row_metadata gene.edgeR.tsv \
              --merge_row_mdata_on gene \
              --rowSide_by edgeR,gene_type \
              --row_labels none \
              --log \
              --pseudocount 0.1 \
              --col_dendro \
              --row_dendro \
              --matrix_palette /tutorial/palettes/palDiverging.txt \
              --colSide_palette /tutorial/palettes/palTissue.txt \
              --output heatmap.brain_X_heart.pdf

4.5. GO enrichment

Prepare a file with the list of all the genes in the annotation:

awk '{split($10,a,/\"|\./); print a[2]}' /tutorial/refs/gencode.vM4.gtf | sort -u > universe.txt

Launch the GO enrichment script for the Biological Processes, Molecular Function and Cellular Components in the set of genes overexpressed in brain:

# BP
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
                  --genes stdin \
                  --categ BP \
                  --output edgeR.over_brain_X_heart \
                  --species mouse

# MF
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
                  --genes stdin \
                  --categ MF \
                  --output edgeR.over_brain_X_heart \
                  --species mouse

# CC
awk '{split($1,a,"."); print a[1]}' edgeR.0.01.over_brain_X_heart.txt \
| GO_enrichment.R --universe universe.txt \
                  --genes stdin \
                  --categ CC \
                  --output edgeR.over_brain_X_heart \
                  --species mouse

The results can be visualized in a web browser, using the online Gene Ontology visualizer REVIGO. As an example, prepare REVIGO input using Biological Process (BP) results:

awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.over_brain_X_heart.BP.tsv

Copy the above command output and paste it in the text box on the main page of the REVIGO web site. Then click the button Start Revigo and have a look at the results.

You can repeat the same for the genes overexpressed in heart.

5. Isoform level RNA-seq analysis

Create a directory dedicated to the splicing analyses within the tutorial folder:

mkdir /tutorial/splicing

and move to it:

cd /tutorial/splicing

5.1. Prepare input files

To generate alternative splicing (AS) events, we need the genome annotation restricted to exon features in the third field of the gtf file. In addition to that, we will filter it for protein-coding transcripts.

# List of transcript ids
awk '$3=="transcript" && $0~/gene_type "protein_coding"/{ match($0, /transcript_id "([^"]+)/, id); print id[1] }' /tutorial/refs/gencode.vM4.gtf |sort -u > protein_coding_transcript_IDs.txt

# Genome annotation restricted to exon features and filtered by transcript type
cat /tutorial/refs/gencode.vM4.gtf |awk '$3=="exon"' |grep -Ff protein_coding_transcript_IDs.txt > exon-annot.gtf

Also restrict the transcript expression matrix (TPM) to only proteing-coding transcripts. Refer to section Transcript and gene expression quantification. And generate individual files per tissue.

# Filter transcript TPM matrix
selectMatrixRows.sh protein_coding_transcript_IDs.txt /tutorial/quantifications/encode.mouse.transcript.TPM.idr_NA.tsv > pc-tx.tsv

# Individual transcript expression matrices
for tissue in Brain Heart Liver; do
    selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 pc-tx.tsv > expr.${tissue}.tsv
done

5.2. Generate local AS events

Extract alternative splicing events from a given genomic annotation of exon-intron gene coordinates. Types of events: Skipping Exon (SE), Alternative 5' (A5) or 3' (A3) splice sites (SS), Mutually Exclusive Exons (MX), Retained intron (RI) and Alternative First (AF)/Last (AL) Exons (FL). Check SUPPA detailed description here.

suppa.py generateEvents -i exon-annot.gtf -e SE SS MX RI FL -o localEvents -f ioe

Check how many local events were generated per type:

wc -l localEvents*ioe

Generate a barplot with the number of events found in the genome annotation. Use the R wrapper ggbarplot.R.

Check parameters by typing ggbarplot.R --help.

5.3. Compute percent spliced in index (PSI) values for local events

Compute PSI for local single exon skipping events (SE):

event=SE; suppa.py psiPerEvent --total-filter 10 --ioe-file localEvents_${event}_strict.ioe --expression-file pc-tx.tsv -o PSI-${event}

Create individual files per tissue:

event=SE; for tissue in Brain Heart Liver ;do selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 PSI-${event}.psi > ${tissue}.${event}.psi;done

Generate a barplot with the number of PSI values of SE events computed for both replicates of each tissue. Why these values vary per tissue?

Use the R wrapper ggbarplot.R. Color the bars by tissues using the palette palTissue.txt located in the /tutorial/palettes folder.

Check parameters by typing ggbarplot.R --help.

5.4. Differential splicing analysis for local events

Compute differential splicing analysis for SE events, comparing all tissues against all and performing multiple testing correction:

event=SE; suppa.py diffSplice --method empirical --input localEvents_${event}_strict.ioe --psi Brain.${event}.psi Heart.${event}.psi Liver.${event}.psi --tpm expr.Brain.tsv expr.Heart.tsv expr.Liver.tsv -c -gc -o DS.${event}

Look at top cases and play with thresholds of p-value and delta PSI:

event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.7 || $2<-0.7) && $3<0.05{print}' DS.${event}.dpsi
event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $4!="nan" && ($4>0.7 || $4<-0.7) && $5<0.05{print}' DS.${event}.dpsi
event=SE; awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $6!="nan" && ($6>0.7 || $6<-0.7) && $7<0.05{print}' DS.${event}.dpsi

5.4.1. Generate a heatmap with best examples

Use the PSI vectors .psivec files and the R wrapper ggheatmap.R.

5.4.2. Look at examples at UCSC genome browser

Have a look at the examples in the heatmap on the UCSC genome browser.

Highlight the skipped exon and zoom out to look at the entire gene loci.

5.5. Functional analysis

Search for functional information of differentially spliced transcripts, using ENSEMBL and other databases. Check for domains, use transcript comparison tool, phenotypes, gene expression, etc.

5.6. Perform PSI calculation and differential splicing analysis for other local event types

Repeat every analysis steps performed for SE events to other AS events. Start by checking MX and AF events.

5.6.1. Generate plots and tables summarizing the analysis performed

Generate a barplot with the number of PSI values computed per tissue (both replicates) per local AS event.

Generate a barplot with the number of top events per tissue were foudn at a chosen threshold of delta PSI (p-value<0.05).

Plots should be visualized outside the Docker container.

Perform other summary statistics of the AS events, e.g. number of events found per chromosome or in the different strands.

6. Regulation of gene and isoform expression

6.1. ChIP-seq data processing

6.1.1. Fastq files and read QC

Move back to the main folder:

cd /tutorial

Create a folder for the fastqc output:

mkdir -p fastqc

Run fastqc:

fastqc -o fastqc -f fastq /tutorial/data/mouse_brain_H3K4me3.fastq.gz

You can view fastqc results on the browser outside the container:

fastqc/mouse_brain_H3K4me3_fastqc.html

6.1.2. Mapping

Create a folder for the alignment steps:

mkdir chip-alignments

Create the genome index for STAR:

mkdir chip-genomeIndex

STAR --runThreadN 2 \
     --runMode genomeGenerate \
     --genomeDir chip-genomeIndex \
     --genomeFastaFiles /tutorial/refs/genome.fa \
     --genomeSAindexNbases 11

If you are unable to run the index command, please download the index wget public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/extra_data/chip-genomeIndex.tar.gz, followed by tar -zxvf chip-genomeIndex.tar.gz.

Run the following command to perform the mapping:

STAR --runThreadN 2 \
     --genomeDir chip-genomeIndex \
     --readFilesIn /tutorial/data/mouse_brain_H3K4me3.fastq.gz \
     --outSAMunmapped Within \
      --outFilterMultimapNmax 10 \
     --alignSJoverhangMin 1000 \
     --alignIntronMax 1 \
     --outSAMattributes NH HI AS NM MD \
     --readFilesCommand pigz -p2 -dc \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix chip-alignments/mouse_brain_H3K4me3_

Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (e.g. STAR --help).

Which parameters had to be changed in order to run ChIP-seq data (both for the generation of the index and for the mapping astep itself)?

Are the number and the set os input files the same?

When finished we can look at the bam file:

samtools view -h chip-alignments/mouse_brain_H3K4me3_Aligned.sortedByCoord.out.bam | more

Type q to exit the view.

or at the mapping statistics that come with STAR:

cat chip-alignments/mouse_brain_H3K4me3_Log.final.out

                                 Started job on |	Dec 27 13:33:10
                             Started mapping on |	Dec 27 13:33:10
                                    Finished on |	Dec 27 13:33:15
       Mapping speed, Million of reads per hour |	72.00

                          Number of input reads |	100000
                      Average input read length |	50
                                    UNIQUE READS:
                   Uniquely mapped reads number |	94051
                        Uniquely mapped reads % |	94.05%
                          Average mapped length |	50.30
                       Number of splices: Total |	0
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	0
                       Number of splices: GC/AG |	0
                       Number of splices: AT/AC |	0
               Number of splices: Non-canonical |	0
                      Mismatch rate per base, % |	1.33%
                         Deletion rate per base |	0.02%
                        Deletion average length |	1.00
                        Insertion rate per base |	0.00%
                       Insertion average length |	1.08
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	2199
             % of reads mapped to multiple loci |	2.20%
        Number of reads mapped to too many loci |	3239
             % of reads mapped to too many loci |	3.24%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |	0.00%
                 % of reads unmapped: too short |	0.46%
                     % of reads unmapped: other |	0.05%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%

6.1.3. Peak calling

Create a folder for the peak calling output:

mkdir peakOut

Run MACS2:

macs2 callpeak --treatment chip-alignments/mouse_brain_H3K4me3_Aligned.sortedByCoord.out.bam \
               --name mouse_brain_H3K4me3 \
               --outdir peakOut \
               --format BAM \
               --gsize mm \
               --qvalue 0.01 \
               --nomodel \
               --extsize=200 \
               --bdg \
               --SPMR

check MACS2 options by typing macs2 callpeak --help

You can view results:

ll -tr peakOut

Check the content of each file and the explanation of each column at MACS2 github page.

6.1.4. ChIP-seq signal files

MACS2 produces two Bedgraph files: one called treat_pileup, the other is control_lambda.

Sort the Bedgraph file using bedtools:

bedtools sort -i peakOut/mouse_brain_H3K4me3_treat_pileup.bdg >\
                 peakOut/mouse_brain_H3K4me3_treat_pileup_sorted.bdg

We convert the treat_pileup BedGraph file to BigWig:

bedGraphToBigWig peakOut/mouse_brain_H3K4me3_treat_pileup_sorted.bdg \
                 /tutorial/refs/genome.fa.fai \
                 peakOut/mouse_brain_H3K4me3_pileup.bw

Remember that, in general, you can always have a look at programs options by just typing the command name or --help after the command (in this case simply type bedGraphToBigWig).

6.1.5. Visualize ChIP-seq signal in the UCSC genome browser

Add the signal track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)

Go to the UCSC genome browser web page:

genome-euro.ucsc.edu

On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:

genome: Mouse    assembly: Dec 2011 (GRCm38/mm10)    [mm10]

Paste the track specifications in the box Paste URLs or data:

track autoScale=on name=ChIPseq,brain,CHIPembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain1.bw
track autoScale=on name=ChIPseq,brain,CHIPembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain2.bw
track autoScale=on name=ChIPseq,heart,CHIPembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoHeart1.bw
track autoScale=on name=ChIPseq,heart,CHIPembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoHeart2.bw
track autoScale=on name=ChIPseq,liver,CHIPembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver1.bw
track autoScale=on name=ChIPseq,liver,CHIPembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver2.bw

Click Submit

Go to the genome browser to look at some genes and their RNA-seq and ChIP-seq signals.

6.2. ChIP-seq data analysis

Create a directory dedicated to the analyses:

mkdir chip-analysis

And move into it:

cd chip-analysis

Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. bedtools --help).

6.2.1. Peak calling statistics

Have a look at the number of peaks called per tissue:

wc -l /tutorial/results/*narrowPeak

6.2.2. Compare peaks

Use bedtools intersect or bedops to identify peaks specific to some tissue(s).

Remember that you can always have a look at the options for all the R scripts that we are using in the analysis by typing --help after the command (e.g. bedtools --help).

6.2.3. Genomic location

Create a BED file with 200bp up and downstream the transcription start site (TSS) of every protein-coding transcript (e.g. use transcript ids from /tutorial/splicing/protein_coding_transcript_IDs.txt section Prepare input files).

Check BED section for further information of file format.

6.2.4. Differential peaks

Intersect the TSS BED file with the differential peaks identified for each tissue.

Use bedtools intersect or bedops for this task.

Look at the UCSC genome browser the differential peaks you defined.

Upload the differential peak BED file in the UCSC genome browser. Find instructions here.

6.3. Multi-omics data analysis

6.3.1. Promoter regions of differentially expressed genes

Use bash commands to intersect the list of differentially expressed genes (DEG) with the differential H3K4me3 peaks computed in section Differential peaks.

Look at promoter regions of DEGs with and without differential peaks of H3K4me3 at the UCSC genome browser.

Do you see a relationship between H3K4me3 marking and gene expression?

How frequent is it? Use your differential peaks and DEGs to compute it.

6.3.2. ENCODE portal

Search in the ENCODE portal www.encodeproject.org/ the dataset we have been using in this hands-on.

Refer to Introduction for the information of this dataset.

Refer to getting started to learn about the portal.

Explore the data available using the Data → Search menu or by clincking in the graphics in the home page.

Explore the information available in the Material & Methods menu.

Look at standards and guidelines.

Look at software and pipelines.

Which data formats are available for download?

Which other omics data is available for the precise same tissue and developmental stage we are using in this hands-on?

Find the signal files of ATAC-seq samples.

6.3.3. Visualize ATAC-seq signal in the UCSC genome browser

Add the signal track to the genome browser in bigWig format. The bigWig files must be either uploaded or linked (if they are present somewhere online)

Go to the UCSC genome browser web page:

genome-euro.ucsc.edu

On the main menu, move to mouse over Genomes and click Mouse GRCm38/mm10. Scroll down and click on Add custom track. Make sure the assembly information is as follows:

genome: Mouse    assembly: Dec 2011 (GRCm38/mm10)    [mm10]

Paste the track specifications in the box Paste URLs or data:

track autoScale=on name=ATACseq,brain,ATACembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain1.bw
track autoScale=on name=ATACseq,brain,ATACembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain2.bw
track autoScale=on name=ATACseq,heart,ATACembryoHeart1 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoHeart1.bw
track autoScale=on name=ATACseq,heart,ATACembryoHeart2 color=128,0,128 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoHeart2.bw
track autoScale=on name=ATACseq,liver,ATACembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver1.bw
track autoScale=on name=ATACseq,liver,ATACembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver2.bw

Click Submit

Go to the genome browser to look at some genes and their RNA-seq, ChIP-seq and ATAC-seq signals.

Look again at the promoter regions of the most interesting examples of DEGs from section Promoter regions of differentially expressed genes.

ATAC-seq signal varies among tissues?

ATAC-seq signals and H3K4me3 signals overlap?

Do signals (ATAC-seq and ChIP-seq vary accordingly)?

6.3.4. Promoter regions of differentially spliced genes

Use bash commands to intersect the list of differentially spliced genes with the differential H3K4me3 peaks computed in section Differential peaks.

Look at promoter regions of alternative first exon events at the UCSC genome browser.

Do you see a relationship between chromatin features (H3K4me3 marking and the open chromatin) with the isoform expression in different tissues?

6.3.5. Omics portals

Exploring the ENCODE portal, which other omics data would you include in your analysis to follow up the study of the differentially expressed/spliced genes?

Propose other analysis that could be performed using the ENCODE data, for instance, related to enhancers or the evolution.

Explore the portal of other omics projects to find out the datasets available, for instance:

GTEX

Human Cell Atlas

Blueprint