Merge pull request #564 from genomic-medicine-sweden/develop

chore: dev to master
genomic-medicine-sweden · Jan 20, 2025 · e1b04e9 · e1b04e9
2 parents 14656e2 + 8dc9563
commit e1b04e9
Show file tree

Hide file tree

Showing 5 changed files with 50 additions and 16 deletions.
diff --git a/docs/dna_cnvs.md b/docs/dna_cnvs.md
@@ -170,7 +170,7 @@ CNV regions that overlap with clinically relevant genes for amplifications ([`cn
  </table>
 
 ## CNV filtering
-Filtering the CNV amplifications and deletions are performed by the [filtering hydra-genetics module](https://filtering.readthedocs.io/en/latest/).
+Filtering the CNV amplifications and deletions are performed by the [filtering hydra-genetics module](https://hydra-genetics-filtering.readthedocs.io/en/latest/).
 
 ### Amplification filtering
 Genes and filtering criteria specified in `config_hard_filter_cnv_amp.yaml` are listed below:
@@ -284,7 +284,7 @@ For more information, see the [hydra-genetics/reports documentation](https://hyd
 
 
 ## Germline vcf
-The germline vcf used by CNVkit, Jumble, and the CNV html report is based on the [VEP annotated vcf](dna_snv_indels.md#vep) file from the SNV and INDEL calling. Annotated vcfs are hard filtered first by removing black listed regions with noisy germline VAFs in normal samples and then filtered by a number of filtering criteria described below. See the [filtering hydra-genetics module](https://filtering.readthedocs.io/en/latest/) for additional information.
+The germline vcf used by CNVkit, Jumble, and the CNV html report is based on the [VEP annotated vcf](dna_snv_indels.md#vep) file from the SNV and INDEL calling. Annotated vcfs are hard filtered first by removing black listed regions with noisy germline VAFs in normal samples and then filtered by a number of filtering criteria described below. See the [filtering hydra-genetics module](https://hydra-genetics-filtering.readthedocs.io/en/latest/) for additional information.
 
 ### Exclude exonic regions
 Use **[bcftools filter -T](https://samtools.github.io/bcftools/bcftools.html)** v1.15 to exclude variants overlapping blacklisted regions defined in a bed file.
@@ -295,7 +295,7 @@ Use **[bcftools filter -T](https://samtools.github.io/bcftools/bcftools.html)**
 * [Bed file](references.md#bcftools_filter_exclude_region) with blacklisted regions
 
 ### Filter vcf
-The germline vcf file are filtered using the **[hydra-genetics filtering](https://filtering.readthedocs.io/en/latest/)** functionality included in v0.15.0.
+The germline vcf file are filtered using the **[hydra-genetics filtering](https://hydra-genetics-filtering.readthedocs.io/en/latest/)** functionality included in v0.15.0.
 
 ### Configuration
 The filters are specified in the config file `config_hard_filter_germline.yaml` and consists of the following filters:

diff --git a/docs/dna_snv_indels.md b/docs/dna_snv_indels.md
@@ -1,5 +1,5 @@
 # SNV and INDEL calling, annotation and filtering
-See the [snv_indels hydra-genetics module](https://hydra-genetics-snv-indels.readthedocs.io/en/latest/) documentation for more details on the softwares for variant calling, [annotation hydra-genetics module](https://annotation.readthedocs.io/en/latest/) for annotation and [filtering hydra-genetics module](https://filtering.readthedocs.io/en/latest/) for filtering. Default hydra-genetics settings/resources are used if no configuration is specified.
+See the [snv_indels hydra-genetics module](https://hydra-genetics-snv-indels.readthedocs.io/en/latest/) documentation for more details on the softwares for variant calling, [annotation hydra-genetics module](https://hydra-genetics-annotation.readthedocs.io/en/latest/) for annotation and [filtering hydra-genetics module](https://hydra-genetics-filtering.readthedocs.io/en/latest/) for filtering. Default hydra-genetics settings/resources are used if no configuration is specified.
 
 <br />
 ![dag plot](images/snv.png)
@@ -87,7 +87,7 @@ Variant vcf files from the two callers are ensembled into one vcf file using **[
 | sort_order | --names vardict, gatk_mutect2 | priority order for retaining variant information |
 
 ## Annotation
-The ensembled vcf file is annotated firstly using VEP, followed by artifact annotation and background annotation. See the [annotation hydra-genetics module](https://annotation.readthedocs.io/en/latest/) for additional information.
+The ensembled vcf file is annotated firstly using VEP, followed by artifact annotation and background annotation. See the [annotation hydra-genetics module](https://hydra-genetics-annotation.readthedocs.io/en/latest/) for additional information.
 
 ### VEP
 The ensembled vcf file is annotated using **[VEP](https://www.ensembl.org/info/docs/tools/vep/index.html)** v105. VEP adds a pletora of information for each variant which is specified by the configuration flags listed below. Of note are --pick which picks only one representative transcript for each variant, --af_gnomad which adds germline information, and --cache which uses a local copy of the databases for better performance. See [VEP options](https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html) for more information.
@@ -161,7 +161,7 @@ Example annotation for one variant added to a vcf file in the INFO field:
 * [Panel of Normal](references.md#background_db) with position specific background information
 
 ## Filtering
-Annotated vcfs are hard filtered first by removing regions outside exons and then filtered by a number of filtering criteria described below. See the [filtering hydra-genetics module](https://filtering.readthedocs.io/en/latest/) for additional information. A soft filtered version of the exonic regions is also provided for development and other investigations.
+Annotated vcfs are hard filtered first by removing regions outside exons and then filtered by a number of filtering criteria described below. See the [filtering hydra-genetics module](https://hydra-genetics-filtering.readthedocs.io/en/latest/) for additional information. A soft filtered version of the exonic regions is also provided for development and other investigations.
 
 ### Extract exonic regions
 Use **[bcftools filter -R](https://samtools.github.io/bcftools/bcftools.html)** v1.15 to extract variants overlapping exonic regions (including 20 bp padding) defined in a bed file which is a sub bed file of the general design bed file.  

diff --git a/docs/running.md b/docs/running.md
@@ -84,7 +84,7 @@ PROJECT_REF_DATA: "PATH_TO/design_and_ref_files" # parent folder for ref_data, e
 ```
 
 ## Input sample files
-The pipeline uses sample input files (`samples.tsv` and `units.tsv`) with information regarding sample information, sequencing meta information as well as the location of the fastq-files. Specification for the input files can be found at [Twist Solid schemas](https://github.com/genomic-medicine-sweden/Twist_Solid/blob/develop/workflow/schemas/). Using the python virtual environment created above it is possible to generate these files automatically using [hydra-genetics create-input-files](https://hydra-genetics.readthedocs.io/en/latest/create_sample_files/):
+The pipeline uses sample input files (`samples.tsv` and `units.tsv`) with information regarding sample information, sequencing meta information as well as the location of the fastq-files. Specification for the input files can be found at [Twist Solid schemas](https://github.com/genomic-medicine-sweden/Twist_Solid/blob/develop/workflow/schemas/). Using the python virtual environment created above it is possible to generate these files automatically using [hydra-genetics create-input-files](https://hydra-genetics.readthedocs.io/en/latest/run_pipeline/create_sample_files/):
 ```bash
 hydra-genetics create-input-files -d path/to/fastq-files/
 ```
@@ -95,7 +95,7 @@ Using the activated python virtual environment created above, this is a basic co
 snakemake --profile profiles/NAME_OF_PROFILE -s workflow/Snakefile
 ```  
 <br />
-The are many additional [snakemake running options](https://snakemake.readthedocs.io/en/stable/executing/cli.html#) some of which is listed below. However, options that are always used should be put in the [profile](https://hydra-genetics.readthedocs.io/en/latest/profile/).
+The are many additional [snakemake running options](https://snakemake.readthedocs.io/en/stable/executing/cli.html#) some of which is listed below. However, options that are always used should be put in the [profile](https://hydra-genetics.readthedocs.io/en/latest/run_pipeline/profile/).
 
 * --notemp - Saves all intermediate files. Good for development and testing different options.
 * --until <rule> - Runs only rules dependent on the specified rule.

diff --git a/docs/setup.md b/docs/setup.md
@@ -8,17 +8,17 @@ There are a number of main files that governs how the pipeline is executed liste
 * profile/uppsala/config.yaml
 * samples.tsv and units.tsv
 
-There is more general information about the content of these files in hydra-genetics documentation in [code standards](https://hydra-genetics.readthedocs.io/en/latest/standards/), [config](https://hydra-genetics.readthedocs.io/en/latest/config/) and [Snakefile](https://hydra-genetics.readthedocs.io/en/latest/import/).
+There is more general information about the content of these files in hydra-genetics documentation in [code standards](https://hydra-genetics.readthedocs.io/en/latest/development/standards/), [config](https://hydra-genetics.readthedocs.io/en/latest/make_pipeline/config/) and [Snakefile](https://hydra-genetics.readthedocs.io/en/latest/make_pipeline/import/).
 
 ## Snakefile
 The `Snakefile` is located in workflow/ and imports hydra-genetics modules and rules as well as modifies these rules when needed. It also imports pipeline specific rules and define rule orders. Finally, this is where the rule all is defined.
 
 ## common.smk
-The `common.smk` is located under workflow/rules/. This is a general rule taking care of any actions that are not directly connected with running a specific program. It includes version checks, import of config, resources, tsv-files and validations using schemas. Functions used by pipeline specific rules are also defined here as well as the output files using the function **compile_output_list** which programmatically generates a list of all necessary output files for the module to be targeted in the all rule defined in the `Snakemake` file. See further [Result files](https://hydra-genetics.readthedocs.io/en/latest/results/).
+The `common.smk` is located under workflow/rules/. This is a general rule taking care of any actions that are not directly connected with running a specific program. It includes version checks, import of config, resources, tsv-files and validations using schemas. Functions used by pipeline specific rules are also defined here as well as the output files using the function **compile_output_list** which programmatically generates a list of all necessary output files for the module to be targeted in the all rule defined in the `Snakemake` file. See further [Result files](https://hydra-genetics.readthedocs.io/en/latest/make_pipeline/results/).
 
 ## config.yaml
 The `config.yaml` is located under config/. The file ties all file and other dependencies as well as parameters for different rules together.
-See further [pipeline configuration](https://hydra-genetics.readthedocs.io/en/latest/config/).
+See further [pipeline configuration](https://hydra-genetics.readthedocs.io/en/latest/make_pipeline/config/).
 
 <br />
 
@@ -30,7 +30,7 @@ See further [pipeline configuration](https://hydra-genetics.readthedocs.io/en/la
 
 
 ## resources.yaml
-The `resources.yaml` is located under config/. The file declares default resources used by rules as well as resources for specific rules that needs more resources than allocated by default. See further [pipeline configuration](https://hydra-genetics.readthedocs.io/en/latest/config/).
+The `resources.yaml` is located under config/. The file declares default resources used by rules as well as resources for specific rules that needs more resources than allocated by default. See further [pipeline configuration](https://hydra-genetics.readthedocs.io/en/latest/make_pipeline/config/).
 
 ```yaml
 # ex, default resources
@@ -78,7 +78,7 @@ default-resources: [threads=1, time="04:00:00", partition="low", mem_mb="3074",
 ```
 
 ## samples.tsv and units.tsv
-The `samples.tsv` and `units.tsv` are input files that must be generated before running the pipeline and should in general be located in the base folder of the analysis folder, can be changed in the config.yaml. See further [running the pipeline](running.md) and [create input files](https://hydra-genetics.readthedocs.io/en/latest/create_sample_files/).
+The `samples.tsv` and `units.tsv` are input files that must be generated before running the pipeline and should in general be located in the base folder of the analysis folder, can be changed in the config.yaml. See further [running the pipeline](running.md) and [create input files](https://hydra-genetics.readthedocs.io/en/latest/run_pipeline/create_sample_files/).
 
 ### Example samples.tsv
 

diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -53,7 +53,12 @@ use rule bcftools_id_snps as bcftools_id_snps_dna with:
 
 module prealignment:
     snakefile:
-        get_module_snakefile(config, "hydra-genetics/prealignment", path="workflow/Snakefile", tag="v1.0.0")
+        get_module_snakefile(
+            config,
+            "hydra-genetics/prealignment",
+            path="workflow/Snakefile",
+            tag="v1.0.0",
+        )
     config:
         config
 
@@ -266,7 +271,10 @@ use rule vep from annotation as annotation_vep_wo_pick with:
     log:
         "{file}.vep_annotated_wo_pick.vcf.log",
     benchmark:
-        repeat("{file}.vep_annotated_wo_pick.vcf.benchmark.tsv", config.get("vep_wo_pick", {}).get("benchmark_repeats", 1))
+        repeat(
+            "{file}.vep_annotated_wo_pick.vcf.benchmark.tsv",
+            config.get("vep_wo_pick", {}).get("benchmark_repeats", 1),
+        )
 
 
 use rule bcftools_annotate from annotation as annotation_bcftools_annotate_purecn with:
@@ -321,7 +329,7 @@ use rule * from qc exclude all as qc_*
 use rule multiqc from qc as qc_multiqc with:
     output:
         html=temp("qc/multiqc/multiqc_{report}.html"),
-        data=temp(directory("qc/multiqc/multiqc_{report}_data")),
+        data=directory("qc/multiqc/multiqc_{report}_data"),
         data_json="qc/multiqc/multiqc_{report}_data/multiqc_data.json",
 
 
@@ -676,6 +684,32 @@ use rule purecn from cnv_sv as cnv_sv_purecn with:
         unpack(cnv_sv.get_purecn_inputs),
         vcf="cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf.gz",
         tbi="cnv_sv/purecn_modify_vcf/{sample}_{type}.normalized.sorted.vep_annotated.filter.snv_hard_filter_purecn.bcftools_annotated_purecn.mbq.vcf.gz.tbi",
+    output:
+        csv="cnv_sv/purecn/temp/{sample}_{type}/{sample}_{type}.csv",
+        outdir=directory("cnv_sv/purecn/temp/{sample}_{type}/"),
+
+
+use rule purecn_coverage from cnv_sv as cnv_sv_purecn_coverage with:
+    output:
+        purecn=expand(
+            "cnv_sv/purecn_coverage/{{sample}}_{{type}}{ext}",
+            ext=[
+                "_coverage.txt.gz",
+                "_coverage_loess.txt.gz",
+                "_coverage_loess.png",
+                "_coverage_loess_qc.txt",
+            ],
+        ),
+
+
+use rule purecn_copy_output from cnv_sv as cnv_sv_purecn_copy_output with:
+    output:
+        files="cnv_sv/purecn/{sample}_{type}{suffix}",
+
+
+use rule purecn_purity_file from cnv_sv as cnv_sv_purecn_purity_file with:
+    output:
+        purity="cnv_sv/purecn_purity_file/{sample}_{type}.purity.txt",
 
 
 module reports: