Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

26 add support for illumina #39

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
assets/databases/emu_database/species_taxid.fasta
assets/databases/emu_database/taxonomy.tsv
assets/databases/krona/taxonomy/images.dmp
assets/databases/krona/taxonomy/taxonomy.tab
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ This pipeline uses code and infrastructure developed and maintained by the [nf-c
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Cutadapt](https://journal.embnet.org/index.php/embnetjournal/article/view/200/479)

> Marcel, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17.1 (2011): pp-10. doi: 10.14806/ej.17.1.200.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
45 changes: 40 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Longfilt, EMU, and Krona. EMU is the tool that does the taxonomic profiling of
ensures portability and reproducibility across different computational
infrastructures. It has been tested on Linux and on mac M1 (not recommended,
quite slow). FastQC and Nanoplot performs quality control, Porechop_ABI trims
adapters (optional)), Longfilt filters the fastq-files such that only reads
adapters (optional), Longfilt filters the fastq-files such that only reads
that are close to 1500 bp are used (optional), EMU assigns taxonomic
classifications, and Krona visualises the result table from EMU. The pipeline
enables microbial community analysis, offering insights into the diversity in
Expand All @@ -35,9 +35,9 @@ and update software dependencies.

![Pipeline overview image](docs/images/gms_16s_20240415.png)

Roadmap/workflow. Only the NanoPore flow is available. Minor testing has been
done for PacBio and it seems to work. short read has no support yet. MultiQC
collects only info from FastQC and some information about software versions and
The Nanopore and shortread workflow is available.
Minor testing has been done for PacBio and it seems to work.
MultiQC collects only info from FastQC and some information about software versions and
pipeline info.

![Krona plot](https://github.com/genomic-medicine-sweden/gms_16S/assets/115690981/dcdd5da4-135c-48c4-b64f-82f0452b5520)
Expand Down Expand Up @@ -111,12 +111,47 @@ nextflow run main.nf \
--barcodes_samplesheet /[absolute path to barcode sample sheet]/sample_sheet_merge.csv
```

## Runs with shortreads

When running gms_16s with short reads, the primer sequences are trimmed using cutadapt by default using the provided primer sequences.
The primer sequences can be provided in the samplesheet or passed as arguments (FW_primer, RV_primer). Primer trimming with cutadapt can be skipped with --skip_cutadapt.

```bash
sample,fastq_1,fastq_2,FW_primer,RV_primer
SAMPLE,/absolute_path/gms_16s/Sample_R1_001.fastq.gz,/absolute_path/gms_16s/Sample_R2_001.fastq.gz,GTGCCAGCMGCCGCGGTAA,GGACTACNVGGGTWTCTAAT
```


```bash
nextflow run main.nf \
--input sample_sheet.csv
--outdir [absolute path]/gms_16S/results \
--db /[absolute path]/gms_16S/assets/databases/emu_database \
--seqtype sr \
-profile singularity \
--quality_filtering \
```

```bash
nextflow run main.nf \
--input sample_sheet.csv
--outdir [absolute path]/gms_16S/results \
--db /[absolute path]/gms_16S/assets/databases/emu_database \
--seqtype sr \
-profile singularity \
--quality_filtering \
--FW_primer AGCTGNCCTG\
--RV_primer TGCATNCTGA
```



## Sample sheets

There are two types of sample sheets that can be used: 1) If the fastq files
are already concatenated/merged i.e., the fastq-files in Nanopore barcode
directories have been concataned already, the `--input` can be used.
`--input` expects a `.csv` sample sheet with 3 columns (note the header
`--input` expects a `.csv` sample sheet with 4 columns (note the header
names). It looks like this (See also the `examples` directory):

```csv
Expand Down
31 changes: 30 additions & 1 deletion conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,35 @@ process {
]
}

withName: CUTADAPT {
ext.args = { [
"--minimum-length 1",
"-O ${params.cutadapt_min_overlap}",
"-e ${params.cutadapt_max_error_rate}",
// Use primers from the samplesheet if available, otherwise fall back to params
meta.fw_primer ? "-g ${meta.fw_primer}" : (params.FW_primer ? "-g ${params.FW_primer}" : ''),
meta.rv_primer ? "-G ${meta.rv_primer}" : (params.RV_primer ? "-G ${params.RV_primer}" : ''),
params.retain_untrimmed ? '' : "--discard-untrimmed"
].findAll { it }.join(' ').trim() } // Remove empty strings and join arguments

ext.prefix = { "${meta.id}.trimmed" }

publishDir = [
[ path: { "${params.outdir}/cutadapt" },
mode: params.publish_dir_mode,
pattern: "*.log"
],
[ path: { "${params.outdir}/cutadapt/trimmed_reads" },
mode: params.publish_dir_mode,
pattern: "*.trim.fastq.gz",
enabled: params.save_intermediates
]
]
}



//
withName: MERGE_BARCODES_SAMPLESHEET {
publishDir = [
path: { "${params.outdir}/fastq_pass_merged" },
Expand Down Expand Up @@ -176,7 +204,8 @@ process {
]
]
}



}

51 changes: 51 additions & 0 deletions modules/nf-core/cutadapt/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
process CUTADAPT {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/cutadapt:4.6--py39hf95cd2a_1' :
'biocontainers/cutadapt:4.6--py39hf95cd2a_1' }"

input:
tuple val(meta), path(reads)

output:
tuple val(meta), path('*.trim.fastq.gz'), emit: reads
tuple val(meta), path('*.log') , emit: log
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def trimmed = meta.single_end ? "-o ${prefix}.trim.fastq.gz" : "-o ${prefix}_1.trim.fastq.gz -p ${prefix}_2.trim.fastq.gz"
"""
cutadapt \\
-Z \\
--cores $task.cpus \\
$args \\
$trimmed \\
$reads \\
> ${prefix}.cutadapt.log
cat <<-END_VERSIONS > versions.yml
"${task.process}":
cutadapt: \$(cutadapt --version)
END_VERSIONS
"""

stub:
def prefix = task.ext.prefix ?: "${meta.id}"
def trimmed = meta.single_end ? "${prefix}.trim.fastq.gz" : "${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz"
"""
touch ${prefix}.cutadapt.log
touch ${trimmed}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
cutadapt: \$(cutadapt --version)
END_VERSIONS
"""
}
58 changes: 58 additions & 0 deletions modules/nf-core/cutadapt/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
me: cutadapt
description: Perform adapter/quality trimming on sequencing reads
keywords:
- trimming
- adapter trimming
- adapters
- quality trimming
tools:
- cuatadapt:
description: |
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
documentation: https://cutadapt.readthedocs.io/en/stable/index.html
doi: 10.14806/ej.17.1.200
licence: ["MIT"]
identifier: biotools:cutadapt
input:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- reads:
type: file
description: |
List of input FastQ files of size 1 and 2 for single-end and paired-end data,
respectively.
output:
- reads:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- "*.trim.fastq.gz":
type: file
description: The trimmed/modified fastq reads
pattern: "*fastq.gz"
- log:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- "*.log":
type: file
description: cuatadapt log file
pattern: "*cutadapt.log"
- versions:
- versions.yml:
type: file
description: File containing software versions
pattern: "versions.yml"
authors:
- "@drpatelh"
- "@kevinmenden"
maintainers:
- "@drpatelh"
- "@kevinmenden"
9 changes: 8 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,15 @@ params {
keep_files = false
output_unclassified = true

//cutadapt
FW_primer = null
RV_primer = null
cutadapt_min_overlap = 3
cutadapt_max_error_rate = 0.1
retain_untrimmed = false
skip_cutadapt = false
save_intermediates = false

//
// porechop_abi
adapter_trimming = false

Expand Down
44 changes: 44 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,50 @@
"description": "minimum mean quality threshold"
}
}
},
"cutadapt_options": {
"title": "Cutadapt options",
"type": "object",
"description": "Options for cutadapt which is used for removing adapter sequences",
"default": "",
"properties": {
"FW_primer": {
"type": "string",
"description": "Forward primer"
},
"RV_primer": {
"type": "string",
"description": "Reverse primer"
},
"cutadapt_max_error_rate": {
"type": "number",
"default": 0.1,
"description": "Sets the maximum error rate for valid matches of primer sequences with reads for cutadapt (-e)."
},

"cutadapt_min_overlap": {
"type": "integer",
"default": 3,
"description": "Minimum overlap for valid matches of primer sequences with reads for cutadapt (-O)."
},

"retain_untrimmed": {
"type": "boolean",
"description": "Cutadapt will retain untrimmed reads, choose only if input reads are not expected to contain primer sequences.",
"default": true
},
"save_intermediates": {
"type": "boolean",
"default": false,
"description": "Save trimmed files from cutadapt "
},

"skip_cutadapt": {
"type": "boolean",
"default": false,
"description": "Skip primer trimming with cutadapt"
}
}
},
"krona_options": {
"title": "krona_options",
Expand Down
10 changes: 6 additions & 4 deletions subworkflows/local/input_check.nf
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,14 @@ workflow INPUT_CHECK {

// Function to get list of [ meta, [ fastq_1, fastq_2 ] ]
def create_fastq_channel(LinkedHashMap row) {
// create meta map
// Create meta map
def meta = [:]
meta.id = row.sample
meta.single_end = row.single_end.toBoolean()
meta.id = row.sample
meta.single_end = row.single_end.toBoolean()
meta.fw_primer = row.FW_primer
meta.rv_primer = row.RV_primer

// add path(s) of the fastq file(s) to the meta map
// Add path(s) of the fastq file(s) to the meta map
def fastq_meta = []
if (!file(row.fastq_1).exists()) {
exit 1, "ERROR: Please check input samplesheet -> Read 1 FastQ file does not exist!\n${row.fastq_1}"
Expand Down
Loading
Loading