{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "Author: JL Villanueva (joseluis.villanueva@crg.eu)\n",
    "\n",
    "This report describes the differential gene expression analysis comparing WT and sh samples (NUDIX5). We will use Kallisto (v 0.43.0) for quantification at the transcript level and Sleuth (v 0.30.0) for testing differential expression (DE) aggregating transcripts into genes.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Initial samples\n",
    "<table>\n",
    " <thead>\n",
    "  <tr>\n",
    "   <th style=\"text-align:right;\"> num </th>\n",
    "   <th style=\"text-align:left;\"> sample </th>\n",
    "   <th style=\"text-align:left;\"> condition </th>\n",
    "   <th style=\"text-align:left;\"> sequencing </th>\n",
    "  </tr>\n",
    " </thead>\n",
    "<tbody>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 1 </td>\n",
    "   <td style=\"text-align:left;\"> rw_026_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 2D_wt </td>\n",
    "   <td style=\"text-align:left;\"> 06/27/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 2 </td>\n",
    "   <td style=\"text-align:left;\"> rw_030_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 2D_wt </td>\n",
    "   <td style=\"text-align:left;\"> 06/27/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 3 </td>\n",
    "   <td style=\"text-align:left;\"> rw_031_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 2D_sh </td>\n",
    "   <td style=\"text-align:left;\"> 06/27/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 4 </td>\n",
    "   <td style=\"text-align:left;\"> rw_032_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 2D_sh </td>\n",
    "   <td style=\"text-align:left;\"> 06/27/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 5 </td>\n",
    "   <td style=\"text-align:left;\"> rw_039_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 3D_wt </td>\n",
    "   <td style=\"text-align:left;\"> 07/24/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 6 </td>\n",
    "   <td style=\"text-align:left;\"> rw_039_02_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 3D_wt </td>\n",
    "   <td style=\"text-align:left;\"> 07/24/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 7 </td>\n",
    "   <td style=\"text-align:left;\"> rw_040_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 3D_sh </td>\n",
    "   <td style=\"text-align:left;\"> 07/24/2018 </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 8 </td>\n",
    "   <td style=\"text-align:left;\"> rw_040_02_01_rnaseq </td>\n",
    "   <td style=\"text-align:left;\"> 3D_sh </td>\n",
    "   <td style=\"text-align:left;\"> 07/24/2018 </td>\n",
    "  </tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "# Mapping statistics\n",
    "<table>\n",
    " <thead>\n",
    "  <tr>\n",
    "   <th style=\"text-align:right;\"> X </th>\n",
    "   <th style=\"text-align:left;\"> sample </th>\n",
    "   <th style=\"text-align:right;\"> reads_mapped </th>\n",
    "   <th style=\"text-align:right;\"> reads_proc </th>\n",
    "   <th style=\"text-align:right;\"> frac_mapped </th>\n",
    "   <th style=\"text-align:right;\"> bootstraps_present </th>\n",
    "   <th style=\"text-align:right;\"> bootstraps_used </th>\n",
    "   <th style=\"text-align:left;\"> condition </th>\n",
    "  </tr>\n",
    " </thead>\n",
    "<tbody>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 1 </td>\n",
    "   <td style=\"text-align:left;\"> rw_026_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 44375965 </td>\n",
    "   <td style=\"text-align:right;\"> 47925975 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9259 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 2D_wt </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 2 </td>\n",
    "   <td style=\"text-align:left;\"> rw_030_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 55177695 </td>\n",
    "   <td style=\"text-align:right;\"> 59107148 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9335 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 2D_wt </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 3 </td>\n",
    "   <td style=\"text-align:left;\"> rw_031_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 44576517 </td>\n",
    "   <td style=\"text-align:right;\"> 48206868 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9247 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 2D_sh </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 4 </td>\n",
    "   <td style=\"text-align:left;\"> rw_032_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 56474041 </td>\n",
    "   <td style=\"text-align:right;\"> 61169209 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9232 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 2D_sh </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 5 </td>\n",
    "   <td style=\"text-align:left;\"> rw_039_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 52429267 </td>\n",
    "   <td style=\"text-align:right;\"> 57390135 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9136 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 3D_wt </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 6 </td>\n",
    "   <td style=\"text-align:left;\"> rw_039_02_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 41912042 </td>\n",
    "   <td style=\"text-align:right;\"> 44742750 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9367 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 3D_wt </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 7 </td>\n",
    "   <td style=\"text-align:left;\"> rw_040_01_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 47846169 </td>\n",
    "   <td style=\"text-align:right;\"> 51477759 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9295 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 3D_sh </td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "   <td style=\"text-align:right;\"> 8 </td>\n",
    "   <td style=\"text-align:left;\"> rw_040_02_01_rnaseq </td>\n",
    "   <td style=\"text-align:right;\"> 47094242 </td>\n",
    "   <td style=\"text-align:right;\"> 50976488 </td>\n",
    "   <td style=\"text-align:right;\"> 0.9238 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:right;\"> 100 </td>\n",
    "   <td style=\"text-align:left;\"> 3D_sh </td>\n",
    "  </tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Considerations\n",
    "Initially we wanted to test wt vs sh in 2D and 3D and also 2Dwt vs 3Dwt / 2Dsh vs 3Dsh. However there is a batch effect as can be seen in the sequencing date, making it impossible to do the latter comparison.\n",
    "\n",
    "Therefore we will only test wt vs sh (2D and 3D).\n",
    "\n",
    "Code for the whole analysis is available in *scripts* folder. Figures in *Figures* and tables in *tables* folders respectively.\n",
    "\n",
    "#### PCA With all samples #####"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/pca_tpm_all_samples.png\">\n",
    "<P>PCAs were generated using TPMs as calculated by Kallisto. It looks like one of the replicas from 3D_wt might have been swapped with 3D_sh. We will explore this later on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2D_wt vs 2D_sh\n",
    "\n",
    "\n",
    "<img src=\"figures/pca_tpm_2D.png\">\n",
    "<P>PCA looks good.\n",
    "    \n",
    "<P>Summary table with mean counts (per condition) per gene is named: <A HREF=\"tables/sleuth_table_wTPM_2D_wt_vs_sh.csv\">sleuth_table_wTPM_2D_wt_vs_sh.csv</A>\n",
    "\n",
    "## DE analysis\n",
    "We do the DE analysis aggregating transcripts into genes. We use the likelihood ratio test (lrt). For this comparison we obtain 350 genes.\n",
    "\n",
    "Summary table for significant genes is named <A HREF=\"tables/significant2D_wt_vs_sh.csv\">significant2D_wt_vs_sh.csv</A>\n",
    "\n",
    "Because for this test sleuth does not return a fold change or equivalent metric, I have calculated 3 additional columns: tpm_wt and tpm_sh that are the sum of the average tpm (in replicas) for all transcripts that belong to a gene. After that we have   tpm_wt_by_sh, that simply divides log2(tpm_wt +0.1/ tpm_sh+0.1) with a pseudocount of 0.1. Therefore we have a relative measure of the change between wt and sh.\n",
    "\n",
    "<img src=\"figures/volcano_2D.png\">\n",
    "\n",
    "## An example of a DE gene: NUDT5\n",
    "<img src=\"figures/NUDT5_boxplot.png\">\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3D_wt vs 3D_sh\n",
    "\n",
    "<P>Summary table with mean counts (per condition) per gene is named: <A HREF=\"tables/sleuth_table_wTPM_3D_wt_vs_sh.csv\">sleuth_table_wTPM_3D_wt_vs_sh.csv</A>\n",
    "\n",
    "We checked the patterns of ALDH1, NUDT5, OCT4 (POU5F1), CD44 that are supposed to be upregulated in *wt*. NUDT5 gene has a profile in the samples compatible with the sample tags shown in the PCA. The other genes can't be found (ALDH1) or have inconsistent patterns across transcripts. Therefore we perform the DE analysis using the original tags although the clustering in the PCA is not very good.\n",
    "\n",
    "We get 3 DE genes using lrt test.\n",
    "\n",
    "Summary table for significant genes is named <A HREF=\"tables/significant3D_wt_vs_sh.csv\">significant3D_wt_vs_sh.csv</A>\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}