The Snakemake workflows management system is a tool to create reproducible and scalable data analyses.
The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consists of multiple lines. The resulting syntax resembles that of original Python constructs.
A Snakemake workflow is defined by specifying rules in a Snakefile. The rules decompose the workflow into small steps by specifying how to create sets of output files from sets of input files. It will automatically determine the dependencies between the rules by matching file names.
This tutorial presents a bioinformatics pipeline using Snakemake
and :ref:`the Reproducibility Framework (RF)
<Reproducibility-Framework>`.
It will used the two classes of L1-associated somatic variants in the human brain from Salk Institute for Biological Studies dataset.
Note
Bioproject: PRJEB10849
SRA Study: ERP012147
.
Then, let's create the pipeline directory structure to store this tutorial.
[userid@login03 ~]$ mkdir -p pipeline/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/bwamem/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/genome/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/bwamem/rmdup/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/bwamem/rmdup/tags/_h
[userid@login03 ~]$ mkdir -p pipeline/cutadapt/bwamem/rmdup/tags/tabix/_h
A directory structure like this is expected to run our pipeline.
[userid@login03 pipeline]$ rf status
. done (step done)
└── cutadapt done
├── bwamem ready to run (step with the run file, but not performed)
│ └── rmdup no run script
│ └── tags no run script (step without the run file)
│ └── tabix no run script
└── genome no _h
└── hg19 ready to run
├── bwa ready to run
└── chromsizes ready to run
Once the scripts are ready to run, as is the case with the genome step. Just run the rf
command recursively using the -r
tag.
[userid@login03 pipeline]$ cd genome/
[userid@login03 genome]$ rf sbatch -r .
Start /home/userid/pipeline/genome/hg19: 2022-05-04 17:30:53-04:00
End /home/userid/pipeline/genome/hg19: 2022-05-04 17:30:53-04:00
Start /home/userid/pipeline/genome/hg19/bwa: 2022-05-04 17:30:53-04:00
End /home/userid/pipeline/genome/hg19/bwa: 2022-05-04 17:30:53-04:00
Start /home/userid/pipeline/genome/hg19/chromsizes: 2022-05-04 17:30:53-04:00
End /home/userid/pipeline/genome/hg19/chromsizes: 2022-05-04 17:30:53-04:00
[rdesouz4@login03 genome]$ rf status
. no _h
└── hg19 done
├── bwa done
└── chromsizes done
To download sequence data files using SRA Toolkit, you need create a run
file into pipeline/_h
folder.
#!/bin/bash
#SBATCH -J sra_tools
#SBATCH -p defq
#SBATCH -N 1
#SBATCH --time=2:00:00
#login01SBATCH --cpus-per-task=4
#SBATCH --output=Array_test.%A_%a.out
#SBATCH --array=1-101
ml sra-tools/3.0.0
# Bioproject PRJEB10849 samples
sra_numbers=($(echo {1016570..1016671}))
sra_id='ERR'${sra_numbers[ $SLURM_ARRAY_TASK_ID - 1 ]}
prefetch --max-size 100G $sra_id --force yes --verify no
fastq-dump --outdir . --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ${sra_id}/${sra_id}.sra
rm $sra_id -Rf
The rf
command will call the run
script to retrieve SRA Normalized Format files with full base quality scores, and store them fastq
files into _m
folder.
[userid@login03 ~]$ cd pipeline/
[userid@login03 ~]$ chmod +x _h/run
[userid@login03 pipeline]$ rf sbatch -v .
all: /home/userid/pipeline/_m/SUCCESS
.ONESHELL:
/home/userid/pipeline/_m/SUCCESS:
echo -n "Start /home/userid/pipeline: "; date --rfc-3339=seconds
mkdir /home/userid/pipeline/_m
cd /home/userid/pipeline/_m
sbatch ../_h/run > nohup.out 2>&1
touch SUCCESS
echo -n "End /home/userid/pipeline: "; date --rfc-3339=seconds
Start /home/userid/pipeline: 2022-04-27 16:14:52-04:00
End /home/userid/pipeline: 2022-04-27 16:14:52-04:00
Note
- Writing Workflows : "In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules."
- Snakefiles and Rules : "A Snakemake workflow defines a data analysis in terms of rules that are specified in the Snakefile."
We will create a hypothetical scenario with precedent steps, where for example the Level 5 (tabix) depends on the Level 4 (tags), and so on.
Note
Level 1 (cutadapt) -> Level 2 (bwamem) -> Level 3 (rmdup) -> Level 4 (tags) -> Level 5 (tabix)
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. It helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way.
[userid@login03 pipeline]$ cd cutadapt/
[userid@login03 cutadapt]$ vi _h/run
#!/bin/bash
#SBATCH -J cutadapt
#SBATCH -p defq
#SBATCH --time=2:00:00
#login01SBATCH --cpus-per-task=4
#SBATCH --output=cutadapt.job.%j.out
module load snakemake/7.6.0
# Syntax to run it on Rockfish cluster
snakemake --jobs 101 --latency-wait 240 --cluster 'sbatch --parsable --distribution=arbitrary' --snakefile ../_h/snakemake.slurm.script
So, we need create a script to perform the rev_comp_seq. Given a DNA sequence in string object, it will return its reverse.
[userid@login03 cutadapt]$ vi ~/.local/bin/rc
#!/bin/bash
if [ ! -z "$1" ]; then
echo "$1" | tr "[ATGCatgc]" "[TACGtacg]" | rev
else
echo ""
echo "usage: rc DNASEQUENCE"
echo ""
fi
[userid@login03 cutadapt]$ chmod +x ~/.local/bin/rc
[userid@login03 cutadapt]$ vi _h/snakemake.slurm.script
import glob
import os.path
import itertools
SOURCE_DIR = '../../_m'
EXT = '_pass_1.fastq.gz'
def sample_dict_iter(path, ext):
for filename in glob.iglob(path+'/*'+ext):
sample = os.path.basename(filename)[:-len(ext)]
yield sample, {'r1_in': SOURCE_DIR + '/' + sample + '_pass_1.fastq.gz',
'r2_in': SOURCE_DIR + '/' + sample + '_pass_2.fastq.gz'
}
SAMPLE_DICT = {k:v for k,v in sample_dict_iter(SOURCE_DIR, EXT)}
#insure errors propogate along pipe'd shell commands
shell.prefix("set -o pipefail; ")
rule all:
input:
expand('../_m/{sample}_{suffix}.fastq.gz',
sample=SAMPLE_DICT.keys(),
suffix=['R1','R2'])
rule cutadapt:
input:
r1 = lambda x: SAMPLE_DICT[x.sample]['r1_in'],
r2 = lambda x: SAMPLE_DICT[x.sample]['r2_in']
output:
r1 = '../_m/{sample}_R1.fastq.gz',
r2 = '../_m/{sample}_R2.fastq.gz'
params:
sample = '{sample}'
shell:
'''
module load cutadapt/3.2
export PATH=$HOME'/.local/bin:'$PATH
R1_ADAPTER='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
R2_ADAPTER='CAAGCAGAAGACGGCATACGAGANNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
NESTED_PRIMER='TAACTAACCTGCACAATGTGCAC'
R1_FRONT=${{R1_ADAPTER}}
R2_FRONT=${{R2_ADAPTER}}${{NESTED_PRIMER}}
R1_END=`rc ${{R2_FRONT}}`
R2_END=`rc ${{R1_FRONT}}`
QUALITY_BASE=33
QUALITY_CUTOFF=28
MINIMUM_LENGTH=36
ADAPTOR_OVERLAP=5
ADAPTOR_TIMES=4
cutadapt -j 0 --quality-base=${{QUALITY_BASE}} --quality-cutoff=${{QUALITY_CUTOFF}} --minimum-length=${{MINIMUM_LENGTH}} --overlap=${{ADAPTOR_OVERLAP}} --times=${{ADAPTOR_TIMES}} --front=${{R1_FRONT}} --adapter=${{R1_END}} --paired-output tmp.2.{params.sample}.fastq -o tmp.1.{params.sample}.fastq {input.r1} {input.r2} > {params.sample}_R1.cutadapt.out
cutadapt -j 0 --quality-base=${{QUALITY_BASE}} --quality-cutoff=${{QUALITY_CUTOFF}} --minimum-length=${{MINIMUM_LENGTH}} --overlap=${{ADAPTOR_OVERLAP}} --times=${{ADAPTOR_TIMES}} --front=${{R2_FRONT}} --adapter=${{R2_END}} --paired-output {output.r1} -o {output.r2} tmp.2.{params.sample}.fastq tmp.1.{params.sample}.fastq > {params.sample}_R2.cutadapt.out
rm -f tmp.2.{params.sample}.fastq tmp.1.{params.sample}.fastq
'''
[userid@login03 cutadapt]$ chmod +x _h/run
[userid@login03 cutadapt]$ rf sbatch .
Start /home/userid/pipeline/cutadapt: 2022-05-04 14:35:06-04:00
End /home/userid/pipeline/cutadapt: 2022-05-04 14:35:06-04:00
[userid@login02 _m]$ sqme
USER ACCOUNT JOBID PARTITION NAME NODES CPUS TIME_LIMIT TIME NODELIST ST REASON
userid rfadmin 4157118 defq snakejob.c 1 1 1:00:00 21:15 c221 R None
userid rfadmin 4157146 defq snakejob.c 1 1 1:00:00 21:15 c301 R None
userid rfadmin 4157061 defq snakejob.c 1 1 1:00:00 21:26 c157 R None
userid rfadmin 4157072 defq snakejob.c 1 1 1:00:00 21:26 c132 R None
userid rfadmin 4157102 defq snakejob.c 1 1 1:00:00 21:26 c303 R None
userid rfadmin 4157046 defq cutadapt 1 1 2:00:00 21:28 c124 R None
To monitoring all submitted processed jobs, tail -f
on the file called cutadapt.job.<JOBID>.out
.
[userid@login03 cutadapt]$ cat _m/cutadapt.job.4157046.out
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 200
Job stats:
job count min threads max threads
-------- ------- ------------- -------------
all 1 1 1
cutadapt 101 1 1
total 102 1 1
Select jobs to execute...
[Wed May 4 14:48:20 2022]
rule cutadapt:
input: ../../_m/ERR1016599_pass_1.fastq.gz, ../../_m/ERR1016599_pass_2.fastq.gz
output: ../_m/ERR1016599_R1.fastq.gz, ../_m/ERR1016599_R2.fastq.gz
jobid: 26
wildcards: sample=ERR1016599
resources: mem_mb=1709, disk_mb=1709, tmpdir=/tmp
Submitted job 26 with external jobid '4157048'.
[Wed May 4 14:48:20 2022]
rule cutadapt:
input: ../../_m/ERR1016661_pass_1.fastq.gz, ../../_m/ERR1016661_pass_2.fastq.gz
output: ../_m/ERR1016661_R1.fastq.gz, ../_m/ERR1016661_R2.fastq.gz
jobid: 86
wildcards: sample=ERR1016661
resources: mem_mb=3245, disk_mb=3245, tmpdir=/tmp
........
........
........
........
[Wed May 4 14:48:30 2022]
rule cutadapt:
input: ../../_m/ERR1016581_pass_1.fastq.gz, ../../_m/ERR1016581_pass_2.fastq.gz
output: ../_m/ERR1016581_R1.fastq.gz, ../_m/ERR1016581_R2.fastq.gz
jobid: 85
wildcards: sample=ERR1016581
resources: mem_mb=1891, disk_mb=1891, tmpdir=/tmp
Submitted job 85 with external jobid '4157148'.
[Wed May 4 14:49:33 2022]
Finished job 37.
1 of 102 steps (1%) done
[Wed May 4 14:50:31 2022]
Finished job 30.
2 of 102 steps (2%) done
[Wed May 4 14:51:35 2022]
Finished job 16.
3 of 102 steps (3%) done
[Wed May 4 14:51:48 2022]
Finished job 25.
4 of 102 steps (4%) done
[Wed May 4 14:51:49 2022]
Finished job 87.
5 of 102 steps (5%) done
Also, it is possible to see the outputs for each sample processed, just monitoring the file called slurm-<snakejobid>.out
.
[userid@login02 _m]$ cat slurm-4157147.out
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Select jobs to execute...
[Wed May 4 14:48:37 2022]
rule cutadapt:
input: ../../_m/ERR1016667_pass_1.fastq.gz, ../../_m/ERR1016667_pass_2.fastq.gz
output: ../_m/ERR1016667_R1.fastq.gz, ../_m/ERR1016667_R2.fastq.gz
jobid: 0
wildcards: sample=ERR1016667
resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp
[Wed May 4 14:51:25 2022]
Finished job 0.
1 of 1 steps (100%) done
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.
[userid@login03 cutadapt]$ cd bwamem
[userid@login03 bwamem]$ vi _h/run
#!/bin/bash
#SBATCH -J bwamem
#SBATCH -p defq
#SBATCH --time=2:00:00
#login01SBATCH --cpus-per-task=4
#SBATCH --output=bwamem.job.job.%j.out
module load snakemake/7.6.0
# Syntax to run it on Rockfish cluster
snakemake --jobs 101 --latency-wait 240 --cluster 'sbatch --parsable --distribution=arbitrary' --snakefile ../_h/snakemake.slurm.script
[userid@login03 bwamem]$ vi _h/snakemake.slurm.script
import glob
import os.path
import itertools
SOURCE_DIR = '../../_m'
EXT = '_R1.fastq.gz'
def sample_dict_iter(path, ext):
for filename in glob.iglob(path+'/*'+ext):
sample = os.path.basename(filename)[:-len(ext)]
yield sample, {'r1_in': SOURCE_DIR + '/' + sample + '_R1.fastq.gz',
'r2_in': SOURCE_DIR + '/' + sample + '_R2.fastq.gz'
}
SAMPLE_DICT = {k:v for k,v in sample_dict_iter(SOURCE_DIR, EXT)}
#insure errors propogate along pipe'd shell commands
shell.prefix("set -o pipefail; ")
rule all:
input:
expand('../_m/{sample}.bam',
sample=SAMPLE_DICT.keys())
rule bwamem:
input:
r1 = lambda x: SAMPLE_DICT[x.sample]['r1_in'],
r2 = lambda x: SAMPLE_DICT[x.sample]['r2_in']
output:
'../_m/{sample}.bam'
params:
sample = '{sample}'
shell:
'''
module load bwa-mem/0.7.17 samtools/1.15.1
export PATH=$HOME'/.local/bin:'$PATH
GENOME='../../../../genome/bwa/_m/hg19.fa'
bwa mem -T 19 -t 4 ${{GENOME}} {input.r1} {input.r2} 2> {params.sample}.stderr | samtools view -S -b - > {output}
'''
[userid@login03 bwamem]$ chmod +x _h/run
[userid@login03 bwamem]$ rf sbatch .
rmdup is a script part of the SLAV-Seq protocol written by Apuã Paquola, coded in Perl to read .bam input files and apply samtools software to treat paired-end reads and single-end reads.
[userid@login03 cutadapt]$ cd rmdup
[userid@login03 rmdup]$ vi _h/run
#!/bin/bash
#SBATCH -J rmdup
#SBATCH -p defq
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=4
#SBATCH --output=rmdup.job.job.%j.out
module load snakemake/7.6.0
# Syntax to run it on Rockfish cluster
snakemake --jobs 101 --latency-wait 240 --cluster 'sbatch --parsable --distribution=arbitrary' --snakefile ../_h/snakemake.slurm.script
[userid@login03 rmdup]$ vi _h/snakemake.slurm.script
import glob
import os.path
import itertools
SOURCE_DIR = '../../_m'
EXT = '.bam'
def sample_dict_iter(path, ext):
for filename in glob.iglob(path+'/*'+ext):
sample = os.path.basename(filename)[:-len(ext)]
yield sample, {'filename': filename}
SAMPLE_DICT = {k:v for k,v in sample_dict_iter(SOURCE_DIR, EXT)}
#insure errors propogate along pipe'd shell commands
shell.prefix("set -o pipefail; ")
rule all:
input:
expand('../_m/{sample}.bam', sample=SAMPLE_DICT.keys())
rule process_one_sample:
input:
lambda x: SAMPLE_DICT[x.sample]['filename']
output:
'../_m/{sample}.bam'
log:
stderr = '{sample}.stderr',
stdout = '{sample}.stdout'
shell:
'../_h/slavseq_rmdup.pl {input} {output}'
[userid@login03 rmdup]$ chmod +x _h/run
[userid@login03 rmdup]$ rf sbatch .
tags is a script part of the SLAV-Seq protocol written by Apuã Paquola, coded in Perl to add the custom flags into bam files.
[userid@login03 rmdup]$ cd tags
[userid@login03 tags]$ vi _h/run
#!/bin/bash
#SBATCH -J tags
#SBATCH -p defq
#SBATCH --time=2:00:00
#login01SBATCH --cpus-per-task=4
#SBATCH --output=tags.job.job.%j.out
module load snakemake/7.6.0
# Syntax to run it on Rockfish cluster
snakemake --jobs 101 --latency-wait 240 --cluster 'sbatch --parsable --distribution=arbitrary' --snakefile ../_h/snakemake.slurm.script
[userid@login03 tags]$ vi _h/snakemake.slurm.script
import glob
import os.path
import itertools
SOURCE_DIR = '../../_m'
EXT = '.bam'
def sample_dict_iter(path, ext):
for filename in glob.iglob(path+'/*'+ext):
sample = os.path.basename(filename)[:-len(ext)]
yield sample, {'filename': SOURCE_DIR + '/' + sample + '.bam'}
SAMPLE_DICT = {k:v for k,v in sample_dict_iter(SOURCE_DIR, EXT)}
#insure errors propogate along pipe'd shell commands
shell.prefix("set -o pipefail; ")
rule all:
input:
expand('../_m/{sample}.bam',
sample=SAMPLE_DICT.keys())
rule tags:
input:
'../../_m/{sample}.bam'
output:
'../_m/{sample}.bam'
params:
sample = '{sample}'
shell:
'''
module load samtools/1.15.1
export PERL5LIB=$HOME'/perl5/lib/perl5/'
export CONSENSUS='ATGTACCCTAAAACTTAGAGTATAATAAA'
export PATH=$HOME'/.local/bin:'$PATH
GENOME='../../../../../../genome/bwa/_m/hg19.fa'
PREFIX_LENGTH=`perl -e 'print length($ENV{{CONSENSUS}})+2'`
R1_FLANK_LENGTH=750
R2_FLANK_LENGTH=${{PREFIX_LENGTH}}
SOFT_CLIP_LENGTH_THRESHOLD=5
(samtools view -h {input} | ../_h/add_tags_hts.pl --genome_fasta_file ${{GENOME}} --prefix_length ${{PREFIX_LENGTH}} --consensus ${{CONSENSUS}} --r1_flank_length ${{R1_FLANK_LENGTH}} --r2_flank_length ${{R2_FLANK_LENGTH}} --soft_clip_length_threshold ${{SOFT_CLIP_LENGTH_THRESHOLD}} | samtools view -S -b - > {output}) 2> {params.sample}.stderr
'''
[userid@login03 tags]$ chmod +x _h/run
[userid@login03 tags]$ rf sbatch .
Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file (in.tab.bgz.tbi or in.tab.bgz.csi) when region is absent from the command-line.
[userid@login03 tags]$ cd tabix
[userid@login03 tabix]$ vi _h/run
#!/bin/bash
#SBATCH -J tabix
#SBATCH -p defq
#SBATCH --time=2:00:00
#login01SBATCH --cpus-per-task=4
#SBATCH --output=tabix.job.job.%j.out
module load snakemake/7.6.0
# Syntax to run it on Rockfish cluster
snakemake --jobs 101 --latency-wait 240 --cluster 'sbatch --parsable --distribution=arbitrary' --snakefile ../_h/snakemake.slurm.script
[userid@login03 tabix]$ vi _h/snakemake.slurm.script
import glob
import os.path
import itertools
import os
import sys
import warnings
import subprocess
SOURCE_DIR = '../../_m'
EXT = '.bam'
def sample_dict_iter(path, ext):
for filename in glob.iglob(path+'/*'+ext):
sample = os.path.basename(filename)[:-len(ext)]
yield sample, {'filename': SOURCE_DIR + '/' + sample + '.bam'}
SAMPLE_DICT = {k:v for k,v in sample_dict_iter(SOURCE_DIR, EXT)}
#insure errors propogate along pipe'd shell commands
shell.prefix("set -o pipefail; ")
rule all:
input:
expand('../_m/{sample}.{ext}',
sample=SAMPLE_DICT.keys(),
ext=['bgz', 'bgz.tbi'])
rule tabix:
input:
'../../_m/{sample}.bam'
output:
bgz = '../_m/{sample}.bgz',
tbi = '../_m/{sample}.bgz.tbi'
params:
sample = '{sample}'
shell:
'''
module load tabix/1.13 samtools/1.15.1 bzip2/1.0.8
export PATH=$HOME'/.local/bin:'$PATH
TMP_DIR='tmp.{params.sample}'
mkdir ${{TMP_DIR}}
export LC_ALL=C
( samtools view {input} | ../_h/sam_to_tabix.py 2>{params.sample}.stderr | sort --temporary-directory=${{TMP_DIR}} --buffer-size=10G -k1,1 -k2,2n -k3,3n | bgzip2 -c > {output.bgz} )
rmdir ${{TMP_DIR}}
tabix -s 1 -b 2 -e 3 -0 {output.bgz}
'''
[userid@login03 tabix]$ chmod +x _h/run
[userid@login03 tabix]$ rf sbatch .
Once you coded the pipeline, just run :ref:`the Reproducibility Framework (RF) <Reproducibility-Framework>`.
├── pipeline no _h
├── cutadapt ready to run
│ └── bwamem no run script
│ └── rmdup no run script
│ └── tags no run script
│ └── tabix no run script
└── genome no _h
└── hg19 ready to run
├── bwa ready to run
└── chromsizes ready to run
You run one level at a time, or you can use the -r
option for recursive. It will perform the rf
command, once the level 1 is finishes, it will run next level, so consecutively.
[userid@login03 ~]$ interact -c 2 -t 120
[userid@c010 ~]$ cd pipeline
[userid@c010 ~]$ rf run -r .
Warning
The rf
command is validated to run in interactive mode, so far.