1 Getting started

A useful skill to have under your belt is to identify RNA-Seq datasets in published papers that you can re-analyze on your own to ask new questions. RNA-seq datasets are abundant, and publicaly accessible. When a group generates RNA-Seq data and publishes a paper with this data, there are always many unexplored analysis that can be done. For example, in the Miura lab, we are interested in Alternative Polyadenylation and Alternative Splicing. Existing RNA-Seq datasets can be re-analyzed using tools such as QAPA and rMATS. We will perform these analysis as part of this tutorial on a published RNA-Seq dataset.

For more on the usefuless of re-analyzing RNA-Seq data as a tool for learning, see this article by our department Brent Graveley.

1.1 Finding Existing RNA-Seq datasets

We will be using a short read RNA-Seq dataset from Kiltschewskij et al., NAR, 2023.

The dataset is a neural differentiation of SHSY5Y cells performed on a NextSeq500 Illumina sequencer. The dataset, deposited in GEO, can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE155432.

1.1.1 The dataset

SRR	SRX	Condition
SRR12352385	SRX8851832	Undiff_1
SRR12352386	SRX8851833	Undiff_2
SRR12352387	SRX8851834	Undiff_3
SRR12352388	SRX8851835	Diff_1
SRR12352389	SRX8851836	Diff_2
SRR12352390	SRX8851837	Diff_3

1.2 Obtaining the .sra files and converting to .fastq

These are very large files, and thus downloading from a web browser is not an option. You’ll have to use a tool called prefetch

This will be performed on the Xanadu HPC and executed with a slurm script prefetch.sl

For a primer on using Xanadu see here

Create a directory called something like /labs/miura/your_name/SHSY5Y

Enter into that directory cd /labs/miura/your_name/SHSY5Y

create a new file called prefetch.sl and edit it by entering nano prefetch.sl

Here is the code for prefetch.sl that you can then paste in

#!/bin/bash

#SBATCH --job-name=prefetch
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=16G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/projects/Karlsruhe/SHSY/eofiles/%x.%j.out         #standard output
#SBATCH --error=/projects/Karlsruhe/SHSY/eofiles/%x.%j.err          #standard error log

module load sratoolkit
prefetch SRX8851832 SRX8851833 SRX8851834 SRX8851835 SRX8851836 SRX8851837

You will change miura@uchc.edu to your email.
Change the --output and -error flags to whatever you like. I suggest /labs/miura/your_name/SHSY5Y/eofiles

remember to change permissions of prefetch.sl so that you can execute the command:

chmod +x prefetch.sl

then run the job by entering:

sbatch prefetch.sl

You can monitor progress of the job by entering:

squeue -u pmiura

substitute your own username for pmiura

1.3 converting to fastq files

Now you will use a tool called fasterqdump which is also part of the sratoolkit

I’m going to skip the SBATCH fields and just tell you the commands you need to have in your script. Maybe call it fasterdump.sh

module load sratoolkit
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352385/SRR12352385.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352386/SRR12352386.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352387/SRR12352387.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352388/SRR12352388.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352389/SRR12352389.sra
fasterq-dump /labs/miura/your_name/SHSY5Y/SRR12352390/SRR12352390.sra

Alternatively you could use a for loop to accomplish the same thing

cd /labs/miura/your_name/SHSY5Y/

SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"
for SAMPLE in $SAMPLES; do
    fasterq-dump /labs/miura/your_name/SHSY5Y/${SAMPLE}/${SAMPLE}.sra
done

Using the for loop is a more versatile method (less typing involved!)

1.4 Running the STAR aligner

There are different methods to align RNA-Seq reads to the genome in a splicing aware manner. Here we use STAR because it is compatible with downstream analysis of alternative splicing with rMATS. If you are looking just for gene expression changes, then a much faster approach is to use pseudo-alignment methods such as Salmon or Kallisto.

This is a resource intensive job. So we’ve increased processors to 6 and mem to 100G.

To run STAR you need an index file for homo sapiens. I’ve provided one here:

/labs/Miura/JD_Ngn/STAR/index/homsap_genome_all_cdna_all_ensembl104_grch38_sjdboverhang75

Here is the slurm script for running STAR.

#!/bin/bash
#SBATCH --job-name=run_STAR
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 6
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=100G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.out
#SBATCH --error=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.err

module load STAR/2.7.1a

cd /projects/Karlsruhe/SHSY/
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"

for SAMPLE in $SAMPLES; do
    STAR --runThreadN 6 \
    --genomeDir /labs/Miura/JD_Ngn/STAR/index/homsap_genome_all_cdna_all_ensembl104_grch38_sjdboverhang75 \
    --readFilesIn /labs/miura/your_name/SHSY5Y/${SAMPLE}.fastq \
    --outFileNamePrefix /labs/miura/your_name/SHSY5Y/outs/${SAMPLE} \
    --outSAMtype BAM SortedByCoordinate \
    --outBAMsortingThreadN 1 \
    --outFilterType BySJout \
    --outFilterMultimapNmax 20 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1 \
    --outFilterMismatchNmax 999 \
    --outFilterMismatchNoverReadLmax 0.04 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000
done

There are, of course, lots of options when running STAR. To learn more, see the documentation for STAR

1.5 Generating an index file (.bai)

Many files are generated by STAR for each sample. One important file is the .bam file. We would like to visualize the RNA-Seq reads on a local computer using Integrated Genomics Viewer IGV. In order to do so, we need to generate an index file (.bai file)

We will run this script using SBATCH as in the previous commands.

module load IGVtools/2.9.1
module load java/11.0.11
module load samtools/1.9

cd /labs/miura/your_name/SHSY5Y/STAR/outs
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"

for SAMPLE in $SAMPLES; do
    igvtools index ${SAMPLE}Aligned.sortedByCoord.out.bam
    printf "\n=====================================\n"
    echo \
    "done now..."
done

1.6 Visualizing bam files using IGV

Now you have .bam and .bai files for each RNA-Seq sample! The .bam files are quite big, and depending on your local computer, they might be too big to conveniently handle. Let’s try to visualize them on IGV.

Download and Install IGV.

Open IGV and then in the top left corner, select “Human (GRCh38/hg38)”

File > load From File -select your .bam file. Just pick one for now.

Navigate to a gene of interest. For example “CALM1”. Do you see the coverage? If so, great!

Next, try to load all 6 tracks. Adjust the window sizes and other parameters to become comfortable. Navigate to these genes:

APBA2, SUGT1, DCUN1D5

Zoom in on their 3’UTR regions. Do you notice any interesting trends in read coverage over the 3’UTRs?

1.7 Generating smaller “bigwig” files for visualization

The bam files are sometimes tough to work with because of their size. We can do a lot with simpler files that do not show individual reads. Let’s try to generate bigwig files using deeptools

1.7.1 Conda

In order to install deeptools on the HPC, we will use conda. In order to learn how to use conda for installation purposes, go here [TO BE PROVIDED]. You’ll need to have deeptools installed, and a conda environment, “MyDeeptoolsEnv” for the below script to work.

Here’s a slurm script for using deeptools to generate .bw files for the .bams. Note that forward and reverse strand specific tracks are generated for each file. This can be useful for visualizing regions where there is convergent overlapping gene expresssion.

#!/bin/bash

#SBATCH --job-name=bigwig
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 10 
#SBATCH --partition=general
#SBATCH --qos=general
#SBATCH --mail-type=END
#SBATCH --mem=100G
#SBATCH --mail-user=miura@uchc.edu
#SBATCH --output=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.out
#SBATCH --error=/labs/miura/your_name/SHSY5Y/eofiles/%x.%j.err

source ~/.bashrc
conda activate MyDeeptoolsEnv

cd /labs/miura/yourfolder/SHSY/STAR/outs
SAMPLES="SRR12352385 SRR12352386 SRR12352387 SRR12352388 SRR12352389 SRR12352390"

for SAMPLE in $SAMPLES; do
    bamCoverage -b ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_fwd.bw --filterRNAstrand forward
    bamCoverage -b ${SAMPLE}Aligned.sortedByCoord.out.bam -o ${SAMPLE}_rev.bw --filterRNAstrand reverse
done

You can now transfer these files to your local computer (which are much smaller), and try visualizing them on IGV.

RNA-Seq analysis tutorial (short read)

Sept 8, 2023