RNA-seq analysis pipeline¶
Description¶
processes mRNA-seq fastq files and delivers both raw and
normalised/scaled count tables. This pipeline also outputs a QC report
per fastq file and a .bam mapping file to use with a genome browser
for instance.
| This pipeline can process single or paired-end data and is mostly
suited for Illumina sequencing data.
Description¶
This pipeline analyses the raw RNA-seq data and produces two files containing the raw and normalized counts.
The raw fastq files will be trimmed for adaptors and quality checked with
fastp.The genome sequence FASTA file will be used for the mapping step of the trimmed reads using
STAR.A GTF annotation file will be used to obtain the raw counts using
subread featureCounts.The raw counts will be scaled by a custom R function that implements the
DESeq2median of ratios method to generate the scaled (“normalized”) counts.
Input files¶
RNA-seq fastq files as listed in the
config/samples.tsvfile. Specify a sample name (e.g. “Sample_A”) in thesamplecolumn and the paths to the forward read (fq1) and to the reverse read (fq2). If you have single-end reads, leave thefq2column empty.A genomic reference in FASTA format. For instance, a fasta file containing the 12 chromosomes of tomato (Solanum lycopersicum).
A genome annotation file in the `GTF format <https://useast.ensembl.org/info/website/upload/gff.html>`__. You can convert a GFF annotation file format into GTF with the gffread program from Cufflinks:
gffread my.gff3 -T -o my.gtf. :warning: for featureCounts to work, the feature in the GTF file should beexonwhile the meta-feature has to betranscript_id.
Below is an example of a GTF file format. :warning: a real GTF file does not have column names (seqname, source, etc.). Remove all non-data rows.
seqname |
source |
feature |
start |
end |
score |
strand |
frame |
attributes |
|---|---|---|---|---|---|---|---|---|
SL4.0ch01 |
maker_ITAG |
CDS |
279 |
743 |
. |
0 |
transcript_id “Solyc01g004000.1.1”; gene_id “gene:Solyc01g004000.1”; gene_name “Solyc01g004000.1”; |
|
SL4.0ch01 |
maker_ITAG |
exon |
1173 |
1616 |
. |
. |
transcript_id “Solyc01g004002.1.1”; gene_id “gene:Solyc01g004002.1”; gene_name “Solyc01g004002.1”; |
|
SL4.0ch01 |
maker_ITAG |
exon |
3793 |
3971 |
. |
. |
transcript_id “Solyc01g004002.1.1”; gene_id “gene:Solyc01g004002.1”; gene_name “Solyc01g004002.1”; |
Output files¶
A table of raw counts called
raw_counts.txt: this table can be used to perform a differential gene expression analysis withDESeq2.A table of DESeq2-normalised counts called
scaled_counts.tsv: this table can be used to perform an Exploratory Data Analysis with a PCA, heatmaps, sample clustering, etc.fastp QC reports: one per fastq file.
bam files: one per fastq file (or pair of fastq files).
Prerequisites: what you should know before using this pipeline¶
Some command of the Unix Shell to connect to a remote server where you will execute the pipeline. You can find a good tutorial from the Software Carpentry Foundation here and another one from Berlin Bioinformatics here.
Some command of the Unix Shell to transfer datasets to and from a remote server (to transfer sequencing files and retrieve the results/). The Berlin Bioinformatics Unix begginer guide available here) should be sufficient for that (check the
wgetandscpcommands).An understanding of the steps of a canonical RNA-Seq analysis (trimming, alignment, etc.). You can find some info here.
Content of this GitHub repository¶
Snakefile: a master file that contains the desired outputs and the rules to generate them from the input files.config/samples.tsv: a file containing sample names and the paths to the forward and eventually reverse reads (if paired-end). This file has to be adapted to your sample names before running the pipeline.config/config.yaml: the configuration files making the Snakefile adaptable to any input files, genome and parameter for the rules.config/refs/: a folder containinga genomic reference in fasta format. The
S_lycopersicum_chromosomes.4.00.chrom1.fais placed for testing purposes.a GTF annotation file. The
ITAG4.0_gene_models.sub.gtffor testing purposes..fastq/: a (hidden) folder containing subsetted paired-end fastq files used to test locally the pipeline. Generated using Seqtk:seqtk sample -s100 <inputfile> 250000 > <output file>This folder should contain thefastqof the paired-end RNA-seq data, you want to run.envs/: a folder containing the environments needed for the pipeline:The
environment.yamlis used by the conda package manager to create a working environment (see below).The
Dockerfileis a Docker file used to build the docker image by refering to theenvironment.yaml(see below).
Installation and usage (local machine)¶
Installation¶
You will need a local copy of the GitHub snakemake_rnaseq repository
on your machine. You can either: - use git in the shell:
git clone git@github.com:BleekerLab/snakemake_rnaseq.git. - click on
“Clone or
download”
and select download. - Then navigate inside the snakemake_rnaseq
folder using Shell commands.
Usage¶
Configuration :pencil2:¶
needs. Make sure you have changed the parameters in the
config/config.yaml file that specifies where to find the sample data
file, the genomic and transcriptomic reference fasta files to use and
the parameters for certains rules etc.
| This file is used so the Snakefile does not need to be changed
when locations or parameters need to be changed.
- round_pushpin
Option 1: conda (easiest)
where core softwares such as Snakemake will be installed.
| 1. Install the Miniconda3 distribution (>= Python 3.7
version) for your OS
(Windows, Linux or Mac OS X).
| 2. Inside a Shell window (command line interface), create a virtual
environment named rnaseq using the envs/environment.yaml file
with the following command:
conda env create --name rnaseq --file envs/environment.yaml 3. Then,
before you run the Snakemake pipeline, activate this virtual environment
with source activate rnaseq.
While a conda environment will in most cases work just fine, Docker
is the recommended solution as it increases pipeline execution
reproducibility.
- whale
Option 2: Docker (recommended)
window and type: docker pull bleekerlab/snakemake_rnaseq:4.7.12 to
retrieve a Docker image that includes the pipeline required softwares
(Snakemake and conda and many others). 3. Run the pipeline on your
system with:
docker run --rm -v $PWD:/home/snakemake/ bleekerlab/snakemake_rnaseq:4.7.12
and add any options for snakemake (-n, --cores 10) etc. The
image was built using a Dockerfile based on the
4.7.12 Miniconda3 official Docker
image.
- whale
Option 3: Singularity
Install singularity
Open a Shell window and type:
singularity run docker://bleekerlab/snakemake_rnaseq:4.7.12to retrieve a Docker image that includes the pipeline required software (Snakemake and conda and many others).Run the pipeline on your system with
singularity run snakemake_rnaseq_4.7.12.sifand add any options for snakemake (-n,--cores 10) etc. The directory where the sif file is stored will automatically be mapped to/home/snakemake. Results will be written to a folder named$PWD/results/(you can changeresultsto something you like in theresult_dirparameter of theconfig.yaml).
Dry run¶
With conda: use the
snakemake -npto perform a dry run that prints out the rules and commands.With Docker: use the
docker run
Real run¶
With conda: snakemake --cores 10
Installation and usage (HPC cluster)¶
Installation¶
You will need a local copy of the GitHub snakemake_rnaseq repository
on your machine. On a HPC system, you will have to clone it using the
Shell command-line:
git clone git@github.com:BleekerLab/snakemake_rnaseq.git. - click on
“Clone or
download”
and select download. - Then navigate inside the snakemake_rnaseq
folder using Shell commands.
References :green_book:¶
Authors¶
Marc Galland, m.galland@uva.nl
Tijs Bliek, m.bliek@uva.nl
Frans van der Kloet f.m.vanderkloet@uva.nl
Acknowledgments :clap:¶
Johannes Köster; creator of Snakemake.