genomic-medicine-sweden · AnderssonOlivia · Nov 18, 2024 · Nov 20, 2024 · Dec 2, 2024 · Jan 10, 2025
@@ -0,0 +1,4 @@
+assets/databases/emu_database/species_taxid.fasta
+assets/databases/emu_database/taxonomy.tsv
+assets/databases/krona/taxonomy/images.dmp
+assets/databases/krona/taxonomy/taxonomy.tab
@@ -20,6 +20,10 @@ This pipeline uses code and infrastructure developed and maintained by the [nf-c
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
+- [Cutadapt](https://journal.embnet.org/index.php/embnetjournal/article/view/200/479)
+
+  > Marcel, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17.1 (2011): pp-10. doi: 10.14806/ej.17.1.200.
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

@@ -18,7 +18,7 @@ Longfilt, EMU, and Krona. EMU is the tool that does the taxonomic profiling of
 ensures portability and reproducibility across different computational
 infrastructures. It has been tested on Linux and on mac M1 (not recommended,
 quite slow). FastQC and Nanoplot performs quality control, Porechop_ABI trims
-adapters (optional)), Longfilt filters the fastq-files such that only reads
+adapters (optional), Longfilt filters the fastq-files such that only reads
 that are close to 1500 bp are used (optional), EMU assigns taxonomic
 classifications, and Krona visualises the result table from EMU. The pipeline
 enables microbial community analysis, offering insights into the diversity in
@@ -35,9 +35,9 @@ and update software dependencies.
 
 ![Pipeline overview image](docs/images/gms_16s_20240415.png)
 
-Roadmap/workflow. Only the NanoPore flow is available. Minor testing has been
-done for PacBio and it seems to work. short read has no support yet. MultiQC
-collects only info from FastQC and some information about software versions and
+ The Nanopore and shortread workflow is available. 
+Minor testing has been done for PacBio and it seems to work.
+MultiQC collects only info from FastQC and some information about software versions and
 pipeline info.
 
 ![Krona plot](https://github.com/genomic-medicine-sweden/gms_16S/assets/115690981/dcdd5da4-135c-48c4-b64f-82f0452b5520)
@@ -111,12 +111,47 @@ nextflow run main.nf \
   --barcodes_samplesheet /[absolute path to barcode sample sheet]/sample_sheet_merge.csv
 ```
 
+## Runs with shortreads
+
+When running gms_16s with short reads, the primer sequences are trimmed using cutadapt by default using the provided primer sequences. 
+The primer sequences can be provided in the samplesheet or passed as arguments (FW_primer, RV_primer). Primer trimming with cutadapt can be skipped with --skip_cutadapt.
+
+```bash
+sample,fastq_1,fastq_2,FW_primer,RV_primer
+SAMPLE,/absolute_path/gms_16s/Sample_R1_001.fastq.gz,/absolute_path/gms_16s/Sample_R2_001.fastq.gz,GTGCCAGCMGCCGCGGTAA,GGACTACNVGGGTWTCTAAT
+```
+
+
+```bash
+nextflow run main.nf \
+  --input sample_sheet.csv
+  --outdir [absolute path]/gms_16S/results \
+  --db /[absolute path]/gms_16S/assets/databases/emu_database \
+  --seqtype sr \
+   -profile singularity \
+  --quality_filtering \
+```
+
+```bash
+nextflow run main.nf \
+  --input sample_sheet.csv
+  --outdir [absolute path]/gms_16S/results \
+  --db /[absolute path]/gms_16S/assets/databases/emu_database \
+  --seqtype sr \
+   -profile singularity \
+  --quality_filtering \
+  --FW_primer AGCTGNCCTG\
+  --RV_primer TGCATNCTGA
+```
+
+
+
 ## Sample sheets
 
 There are two types of sample sheets that can be used: 1) If the fastq files
 are already concatenated/merged i.e., the fastq-files in Nanopore barcode
 directories have been concataned already, the `--input` can be used.
-`--input` expects a `.csv` sample sheet with 3 columns (note the header
+`--input` expects a `.csv` sample sheet with 4 columns (note the header
 names). It looks like this (See also the `examples` directory):
 
 ```csv

@@ -26,7 +26,35 @@ process {
         ]
     }
 
+withName: CUTADAPT {
+    ext.args = { [
+        "--minimum-length 1",
+        "-O ${params.cutadapt_min_overlap}",
+        "-e ${params.cutadapt_max_error_rate}",
+        // Use primers from the samplesheet if available, otherwise fall back to params
+        meta.fw_primer ? "-g ${meta.fw_primer}" : (params.FW_primer ? "-g ${params.FW_primer}" : ''),
+        meta.rv_primer ? "-G ${meta.rv_primer}" : (params.RV_primer ? "-G ${params.RV_primer}" : ''),
+        params.retain_untrimmed ? '' : "--discard-untrimmed"
+    ].findAll { it }.join(' ').trim() } // Remove empty strings and join arguments
+
+    ext.prefix = { "${meta.id}.trimmed" }
 
+    publishDir = [
+        [   path: { "${params.outdir}/cutadapt" },
+            mode: params.publish_dir_mode,
+            pattern: "*.log"
+        ],
+        [   path: { "${params.outdir}/cutadapt/trimmed_reads" },
+            mode: params.publish_dir_mode,
+            pattern: "*.trim.fastq.gz",
+            enabled: params.save_intermediates
+        ]
+    ]
+}
+
+
+
+//
     withName: MERGE_BARCODES_SAMPLESHEET {
         publishDir = [
             path: { "${params.outdir}/fastq_pass_merged" },
@@ -176,7 +204,8 @@ process {
             ]
         ]
     }
- 
+
 
 
 }
+
@@ -0,0 +1,51 @@
+process CUTADAPT {
+    tag "$meta.id"
+    label 'process_medium'
+
+    conda "${moduleDir}/environment.yml"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/cutadapt:4.6--py39hf95cd2a_1' :
+        'biocontainers/cutadapt:4.6--py39hf95cd2a_1' }"
+
+    input:
+    tuple val(meta), path(reads)
+
+    output:
+    tuple val(meta), path('*.trim.fastq.gz'), emit: reads
+    tuple val(meta), path('*.log')          , emit: log
+    path "versions.yml"                     , emit: versions
+
+    when:
+    task.ext.when == null || task.ext.when
+
+    script:
+    def args = task.ext.args ?: ''
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    def trimmed  = meta.single_end ? "-o ${prefix}.trim.fastq.gz" : "-o ${prefix}_1.trim.fastq.gz -p ${prefix}_2.trim.fastq.gz"
+    """
+    cutadapt \\
+        -Z \\
+        --cores $task.cpus \\
+        $args \\
+        $trimmed \\
+        $reads \\
+        > ${prefix}.cutadapt.log
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        cutadapt: \$(cutadapt --version)
+    END_VERSIONS
+    """
+
+    stub:
+    def prefix  = task.ext.prefix ?: "${meta.id}"
+    def trimmed = meta.single_end ? "${prefix}.trim.fastq.gz" : "${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz"
+    """
+    touch ${prefix}.cutadapt.log
+    touch ${trimmed}
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        cutadapt: \$(cutadapt --version)
+    END_VERSIONS
+    """
+}
@@ -0,0 +1,58 @@
+me: cutadapt
+description: Perform adapter/quality trimming on sequencing reads
+keywords:
+  - trimming
+  - adapter trimming
+  - adapters
+  - quality trimming
+tools:
+  - cuatadapt:
+      description: |
+        Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
+      documentation: https://cutadapt.readthedocs.io/en/stable/index.html
+      doi: 10.14806/ej.17.1.200
+      licence: ["MIT"]
+      identifier: biotools:cutadapt
+input:
+  - - meta:
+        type: map
+        description: |
+          Groovy Map containing sample information
+          e.g. [ id:'test', single_end:false ]
+    - reads:
+        type: file
+        description: |
+          List of input FastQ files of size 1 and 2 for single-end and paired-end data,
+          respectively.
+output:
+  - reads:
+      - meta:
+          type: map
+          description: |
+            Groovy Map containing sample information
+            e.g. [ id:'test', single_end:false ]
+      - "*.trim.fastq.gz":
+          type: file
+          description: The trimmed/modified fastq reads
+          pattern: "*fastq.gz"
+  - log:
+      - meta:
+          type: map
+          description: |
+            Groovy Map containing sample information
+            e.g. [ id:'test', single_end:false ]
+      - "*.log":
+          type: file
+          description: cuatadapt log file
+          pattern: "*cutadapt.log"
+  - versions:
+      - versions.yml:
+          type: file
+          description: File containing software versions
+          pattern: "versions.yml"
+authors:
+  - "@drpatelh"
+  - "@kevinmenden"
+maintainers:
+  - "@drpatelh"
+  - "@kevinmenden"
@@ -22,8 +22,15 @@ params {
     keep_files                 = false
     output_unclassified        = true
 
+//cutadapt
+    FW_primer                  = null
+    RV_primer                  = null 
+    cutadapt_min_overlap       = 3
+    cutadapt_max_error_rate    = 0.1
+    retain_untrimmed           = false 
+    skip_cutadapt              = false
+    save_intermediates         = false
 
-   //
    // porechop_abi
     adapter_trimming            = false
 

@@ -134,6 +134,50 @@
           "description": "minimum mean quality threshold"
         }
       }
+    },
+     "cutadapt_options": {
+      "title": "Cutadapt options",
+      "type": "object",
+      "description": "Options for cutadapt which is used for removing adapter sequences",
+      "default": "",
+      "properties": {
+        "FW_primer": {
+          "type": "string",
+          "description": "Forward primer"
+        },
+     "RV_primer": {
+          "type": "string",
+          "description": "Reverse primer"
+        },
+     "cutadapt_max_error_rate": {
+          "type": "number",
+          "default": 0.1,
+          "description": "Sets the maximum error rate for valid matches of primer sequences with reads for cutadapt (-e)."
+        },
+
+       "cutadapt_min_overlap": {
+           "type": "integer",
+            "default": 3,
+            "description": "Minimum overlap for valid matches of primer sequences with reads for cutadapt (-O)."
+         },
+
+      	"retain_untrimmed": {
+           "type": "boolean",
+           "description": "Cutadapt will retain untrimmed reads, choose only if input reads are not expected to contain primer sequences.",
+ 	   "default": true
+         },
+       "save_intermediates": {
+          "type": "boolean",
+          "default": false,
+          "description": "Save trimmed files from cutadapt "
+        },
+
+        "skip_cutadapt": {
+          "type": "boolean",
+          "default": false,
+          "description": "Skip primer trimming with cutadapt"
+        }
+      }
     },
     "krona_options": {
       "title": "krona_options",

@@ -22,12 +22,14 @@ workflow INPUT_CHECK {
 
 // Function to get list of [ meta, [ fastq_1, fastq_2 ] ]
 def create_fastq_channel(LinkedHashMap row) {
-    // create meta map
+    // Create meta map
     def meta = [:]
-    meta.id         = row.sample
-    meta.single_end = row.single_end.toBoolean()
+    meta.id                 = row.sample
+    meta.single_end         = row.single_end.toBoolean()
+    meta.fw_primer          = row.FW_primer
+    meta.rv_primer          = row.RV_primer
 
-    // add path(s) of the fastq file(s) to the meta map
+    // Add path(s) of the fastq file(s) to the meta map
     def fastq_meta = []
     if (!file(row.fastq_1).exists()) {
         exit 1, "ERROR: Please check input samplesheet -> Read 1 FastQ file does not exist!\n${row.fastq_1}"