A useful skill to have under your belt is to identify RNA-Seq datasets in published papers that you can re-analyze on your own to ask new questions. RNA-seq datasets are abundant, and publicaly accessible. When a group generates RNA-Seq data and publishes a paper with this data, there are always many unexplored analysis that can be done. For example, in the Miura lab, we are interested in Alternative Polyadenylation and Alternative Splicing. Existing RNA-Seq datasets can be re-analyzed using tools such as QAPA
and rMATS
. We will perform these analysis as part of this tutorial on a published RNA-Seq dataset.
For more on the usefuless of re-analyzing RNA-Seq data as a tool for learning, see this article by our department Brent Graveley.
We will be using a short read RNA-Seq dataset from Kiltschewskij et al., NAR, 2023.
The dataset is a neural differentiation of SHSY5Y cells performed on a NextSeq500 Illumina sequencer. The dataset, deposited in GEO, can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE155432.
SRR | SRX | Condition |
---|---|---|
SRR12352385 | SRX8851832 | Undiff_1 |
SRR12352386 | SRX8851833 | Undiff_2 |
SRR12352387 | SRX8851834 | Undiff_3 |
SRR12352388 | SRX8851835 | Diff_1 |
SRR12352389 | SRX8851836 | Diff_2 |
SRR12352390 | SRX8851837 | Diff_3 |
These are very large files, and thus downloading from a web browser is not an option. You’ll have to use a tool called prefetch
This will be performed on the Xanadu HPC and executed with a slurm script prefetch.sl
Create a directory called something like /labs/miura/your_name/SHSY5Y
Enter into that directory
cd /labs/miura/your_name/SHSY5Y
create a new file called prefetch.sl
and edit it by entering
nano prefetch.sl
Here is the code for prefetch.sl
that you can then paste in
#!/bin/bash
#SBATCH --job-name=prefetch
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=16G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/projects/Karlsruhe/SHSY/eofiles/%x.%j.out #standard output
#SBATCH --error=/projects/Karlsruhe/SHSY/eofiles/%x.%j.err #standard error log
module load sratoolkit
prefetch SRX8851832 SRX8851833 SRX8851834 SRX8851835 SRX8851836 SRX8851837
miura@uchc.edu
to your email.--output
and -error
flags to whatever you like. I suggest /labs/miura/your_name/SHSY5Y/eofiles
remember to change permissions of prefetch.sl so that you can execute the command:
chmod +x prefetch.sl
then run the job by entering:
sbatch prefetch.sl
You can monitor progress of the job by entering:
squeue -u pmiura
Now you will use a tool called fasterqdump
which is also part of the sratoolkit
I’m going to skip the SBATCH fields and just tell you the commands you need to have in your script. Maybe call it fasterdump.sh
module load sratoolkit
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352385/SRR12352385.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352386/SRR12352386.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352387/SRR12352387.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352388/SRR12352388.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352389/SRR12352389.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352390/SRR12352390.sra
Alternatively you could use a for
loop to accomplish the same thing
cd /labs/miura/your_name/SHSY5Y/
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"
for SAMPLE in $SAMPLES; do
fasterq-dump /labs/miura/your_name/SHSY5Y/${SAMPLE}/${SAMPLE}.sra
done
Using the for loop is a more versatile method (less typing involved!)
There are different methods to align RNA-Seq reads to the genome in a splicing aware manner. Here we use STAR
because it is compatible with downstream analysis of alternative splicing with rMATS
. If you are looking just for gene expression changes, then a much faster approach is to use pseudo-alignment methods such as Salmon or Kallisto.
This is a resource intensive job. So we’ve increased processors to 6 and mem to 100G.
To run STAR you need an index file for homo sapiens. I’ve provided one here:
/labs/Miura/JD_Ngn/STAR/index/homsap_genome_all_cdna_all_ensembl104_grch38_sjdboverhang75
#!/bin/bash
#SBATCH --job-name=run_STAR
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 6
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=100G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.out
#SBATCH --error=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.err
module load STAR/2.7.1a
cd /projects/Karlsruhe/SHSY/
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"
for SAMPLE in $SAMPLES; do
STAR --runThreadN 6 \
--genomeDir /labs/Miura/JD_Ngn/STAR/index/homsap_genome_all_cdna_all_ensembl104_grch38_sjdboverhang75 \
--readFilesIn /labs/miura/your_name/SHSY5Y/${SAMPLE}.fastq \
--outFileNamePrefix /labs/miura/your_name/SHSY5Y/outs/${SAMPLE} \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 1 \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000
done
There are, of course, lots of options when running STAR. To learn more, see the documentation for STAR
Many files are generated by STAR for each sample. One important file is the .bam file. We would like to visualize the RNA-Seq reads on a local computer using Integrated Genomics Viewer IGV. In order to do so, we need to generate an index file (.bai file)
We will run this script using SBATCH as in the previous commands.
module load IGVtools/2.9.1
module load java/11.0.11
module load samtools/1.9
cd /labs/miura/your_name/SHSY5Y/STAR/outs
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"
for SAMPLE in $SAMPLES; do
igvtools index ${SAMPLE}Aligned.sortedByCoord.out.bam
printf "\n=====================================\n"
echo \
"done now..."
done
Now you have .bam and .bai files for each RNA-Seq sample! The .bam files are quite big, and depending on your local computer, they might be too big to conveniently handle. Let’s try to visualize them on IGV.
Download and Install IGV.
Open IGV and then in the top left corner, select “Human (GRCh38/hg38)”
File > load From File -select your .bam file. Just pick one for now.
Navigate to a gene of interest. For example “CALM1”. Do you see the coverage? If so, great!
Next, try to load all 6 tracks. Adjust the window sizes and other parameters to become comfortable. Navigate to these genes:
APBA2, SUGT1, DCUN1D5
Zoom in on their 3’UTR regions. Do you notice any interesting trends in read coverage over the 3’UTRs?
The bam files are sometimes tough to work with because of their size. We can do a lot with simpler files that do not show individual reads. Let’s try to generate bigwig
files using deeptools
In order to install deeptools
on the HPC, we will use conda. In order to learn how to use conda for installation purposes, go here [TO BE PROVIDED]. You’ll need to have deeptools installed, and a conda environment, “MyDeeptoolsEnv” for the below script to work.
Here’s a slurm script for using deeptools
to generate .bw files for the .bams. Note that forward and reverse strand specific tracks are generated for each file. This can be useful for visualizing regions where there is convergent overlapping gene expresssion.
#!/bin/bash
#SBATCH --job-name=bigwig
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 10
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=100G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.out
#SBATCH --error=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.err
source ~/.bashrc
conda activate MyDeeptoolsEnv
cd /labs/miura/yourfolder/SHSY/STAR/outs
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"
for SAMPLE in $SAMPLES; do
bamCoverage -b ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_fwd.bw --filterRNAstrand forward
bamCoverage -b ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_rev.bw --filterRNAstrand reverse
done
You can now transfer these files to your local computer (which are much smaller), and try visualizing them on IGV.