Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

michael-harper · 2024-06-05T05:03:35Z

See the context here in the background of the scoping document

Summary
This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.

Proposed Changes

Separation of Pipelines:

Alignment and Genotyping: Split into standalone pipelines to prevent versioning conflicts and ensure consistent inputs.
Modularity: Allows independent updates and optimisations for each workflow without mutual disruption.

Pipeline Starting Points:

Detection of CRAM and gVCF Files: Adjust pipelines to ensure clear and informative error messages for missing data, and allow manual triggering of the genotyping pipeline as needed.
Resource Dependencies: Replace stage dependencies with resource dependencies, requiring a CPG-processed CRAM in Metamist before running the genotyping pipeline, as well as either fastq or CRAM files registered in Metamist prior to running the alignment pipeline.

Repository Structure:

Current Limitations: The existing structure is not conducive to navigating and understanding independent pipelines.
Proposed Structure: Consider separate repositories for the production pipelines API and actual pipelines, with clear folder structures for individual pipelines and shared resources. This PR implements an interim folder structure.
This PR will provide an interim folder structure for the alignment and genotyping pipelines.

Considerations

Integration with Custom Cohorts: Ensure that updates to cohorts with additional samples or sequencing groups are managed without disrupting the pipeline.
User Responsibilities: Users must ensure that samples in custom cohorts have the required gVCF files before running the pipeline.
Further Discussion: Topics such as repository restructuring and shared resource versioning require further discussion.
Version control: Yet to be defined but this separation will hopefully help future efforts in this domain
Breaking continuity: Production-pipelines is designed to automatically trigger stages when inputs do not exist. Extracting alignment and genotyping pipelines from downstream workflows breaks this continuity. Users will need to manually trigger alignment and/or genotyping pipelines to ensure that all sequencing groups have the correct input for downstream analysis. This break in continuity, although contrary to the design logic of production-pipelines, ensures the separation of pipelines for future version control efforts and prevents erroneous pipeline runs without the correct inputs for all samples.

…. Also removing the required_stage of align from Genotype stage

…retrieve

…under the cpg_workflows/stages directory. Also moving cram_qc.py and gvcf_qc.py into cpg_workflows/stages/alignment and cpg_workflows/stages/genotype respectively to be more in line with these new pipelines 'owning' these stages

… import conflicts and changing import in gvcf_qc.py to be a relative import so that it can successfully import the class Genotype

…e mark duplicates tool (Picard)

…nt.toml

…h job submission in Hail Batch

…fault genotype.toml file

…to use DRAGMAP

…r a sequencing group we can specify in the config file whether we want to realign from cram or fastq

…stence checks

…ram file by referening sequencing_group.cram directly and including an assertion

…ignment pipeline. Performing existence check of verifybamID from CramQC stage and logging to the user to run the alignment pipeline to produce output if it doesn't exist

…fusion. Simplification of align.py (job) and removal of Bazam tool is now in a separate branch and PR

…type() function.

…enotype pipelines however can still use the stages at a later date. For example, when we create custom cohorts and do QC, we would want to be able to run MultiQC to generate a summary report across all the individual CRAM QC metrics for the samples in that custom cohort

… cram and gvcf paths to the sequencing group in the test

…ongenomics/production-pipelines into separate-align-genotype

MattWellie · 2024-07-03T23:14:43Z

cpg_workflows/large_cohort/combiner.py

@@ -26,7 +26,8 @@ def _check_gvcfs(sequencing_groups: list[SequencingGroup]) -> list[SequencingGro
                    f'Sequencing group {sequencing_group} is missing GVCF. '
                    f'Use workflow/skip_sgs = [] or '


I think this argument will be disappearing soon following the implementation of custom cohorts, @vivbak ?

MattWellie · 2024-07-03T23:15:27Z

cpg_workflows/stages/alignment/align.py

+@stage(
+    analysis_type='cram',
+    analysis_keys=['cram'],
+)


Suggested change

@stage(

analysis_type='cram',

analysis_keys=['cram'],

)

@stage(analysis_type='cram', analysis_keys=['cram'])

MattWellie · 2024-07-03T23:23:56Z

cpg_workflows/stages/alignment/align.py

+from enum import Enum
+from logging import config
+from pickle import MARK


lots of unused imports here, not sure how this passed linting checks

MattWellie · 2024-07-03T23:25:46Z

cpg_workflows/stages/genotyping/genotype.py

    stage,
 )

-from .. import get_batch
-from .align import Align
+from ... import get_batch


not a fan of this at all. We're installing cpg_workflows as a package in the image we intend to run. Relative imports are garbage.

This is also the wrong import, we're using cpg_utils.hail_batch.get_batch

Suggested change

from ... import get_batch

from cpg_utils.hail_batch import get_batch

MattWellie · 2024-07-03T23:40:40Z

cpg_workflows/stages/genotyping/gvcf_qc.py

@@ -24,6 +23,8 @@
    stage,
 )

+from .genotype import Genotype


BLEH RELATIVE IMPORTS

MattWellie · 2024-07-04T00:13:30Z

I think it's worth having a conversation about how to segregate this properly - i.e. you've sub-divided these stages into cpg_workflows.stages.XXX, which feels like a very partial solution. That puts us in a weird position where we're trying to separately version a few files inside cpg_workflows, which is itself versioned at the top level.

There's a broader issue of how this pipeline is run - we start the driver image by doing a git clone of production-pipelines into the VM, then we run main.py. That's fine. But when we import cpg_workflows.XXX it imports from the relative path cpg_workflows.XXX that we just cloned into, not the installed version of the codebase (because AFAIK local filepaths are traversed first when Python locates modules). That leads to 'fun' issues where the version of the code being run is based on the current commit, up until we run a python job or a query_command - those are executed in a new VM without the git clone, so they read from the installed version of the codebase (potentially a different version of the codebase). I don't think that should really be addressed by this change, but we should resolve this ambiguity.

A blue sky thinking approach to this division:

- src
  - alignment_workflow
    - stages
      - align
      - cram_qc
    - version.py
  - genotype_workflow
    - stages
      - genotype
      - gvcf_qc
    - version.py
  - gatk_sv_workflow
    - stages
    - version.py
  - large_cohort_workflow
    - stages
    - version.py
  - rare_disease_workflow
    - stages
    - version.py
  - ...? other pipelines we want to version separately
  - cpg_workflows (the library which represents the abstract pipeline - interactions with metamist, setting up a workflow graph, etc.)
- main.py (imports from all the relevant INSTALLED pipelines and builds workflows from them)
- tests
- pyproject.toml

This would enable us to ensure we're only using the installed version of the code (from cpg_workflows.stages.XXX import YYY is guaranteed to use the imported code, as there's no local path .cpg_workflows, so we know we're using the installed code)
This would enable use to split out the pipeline framework from the stages that rely on it
This would enable use to version each coherent pipeline separately

Undoubtably this is more effort, it would involve reorganising the production pipelines codebase and having to build a new version of the cpg_workflows image for each test run in GCP.

However, it would resolve a fundamental ambiguity in our production pipelines workflows - when we run our pipelines, we don't truly know which version of the code we're running. That is a very stupid problem to have, when this whole effort is around versioning our pipelines more specifically. We ARE using cloud compute, so we ARE using containerised pipelines. With our current structure, we're using some of the code in the container, and some pulled in from github at runtime.

This is kinda twinned with #647, a pet peeve of mine - we're aware, and kind of.. OK? with not really knowing exactly what code is going to run in our main pipeline.

n.b. this isn't a proposal so much as it is a kind of... rant. What we have is not good, I don't think this PR solves that (it covers physical separation of the stages, but doesn't address things we should probably address first, like "what code are we running")

^^ I actually don't know if any of this is feasible

MattWellie · 2024-07-04T02:06:55Z

As a side note, this is a huge leap considering that AFAIK the RFC/scoping doc here has not been reviewed by anyone in software, and there are a number of discussion points identifying that @vivbak and other members of data/software should be consulted before going ahead with implementation.

michael-harper added 30 commits June 3, 2024 07:52

Changing main.py to have alignment and genotype as separate workflows…

8543ab6

…. Also removing the required_stage of align from Genotype stage

Adding a default config TOML file for alignment that the main.py can …

5c387e8

…retrieve

Adding genotype.toml

b3264bc

Adding init files to alignment and genotype stage directories

0e4aa39

Changing cpg_workflows/stages/genotype/ to genotype_pipeline to avoid…

412e9bf

… import conflicts and changing import in gvcf_qc.py to be a relative import so that it can successfully import the class Genotype

fixing import in main.py

301ab0d

fixing import statements in various files

683424c

missed import fix

4b64401

Simplifying align.py so that it uses only one mapper (DRAGMAP) and on…

02b1217

…e mark duplicates tool (Picard)

Removing if statements checking for multiple tools

a6e67a3

Moving cramqc default config params from large_cohort.toml to alignme…

ec01b7b

…nt.toml

Changing the pool label back to 'large_cohort' as it causes issue wit…

ad58640

…h job submission in Hail Batch

pool_label should be 'large-cohort' not 'large_cohort'

8a22a6a

Changing pool_label to 'large-cohort' instead of 'genotype' in the de…

c7c3ecc

…fault genotype.toml file

Adding version control to analysis meta field. Forcing align.align() …

4b68caa

…to use DRAGMAP

removing global variables

2a9eda5

adding a realign feature so that if a fastq and cram already exist fo…

21d1e05

…r a sequencing group we can specify in the config file whether we want to realign from cram or fastq

extracting fastq files from Cram/Bam instead of using Bazam

cb2fbb7

Fixing variable assignment

f6cf6d1

fixing logic and variable assignment for extracting fastqs

f60a056

fixing variable assignment

ffc2368

not sharding cram input

aab99e8

Specifying which reference to use to avoid samtools having to cache

e0f4cdf

fixing reference caching

e227fef

pointing to reference path correctly in samtools

dc8ac01

using global --reference flag to specify reference

0f9f3e0

checking if exporting REF_PATH is required when using --reference flag

6c6a646

checking reference assembly

e67e11c

finalising fastq extraction

845e83a

michael-harper and others added 23 commits June 24, 2024 14:38

removing required_stages=Combiner

d69f9a4

Removing required_stages=Align from RealignMito stage, and adding exi…

ec7e8cf

…stence checks

Commenting gvcf checking

ebfcaa6

Removing required_stages=Align from stripy.py. Checking presence of c…

6f88ce9

…ram file by referening sequencing_group.cram directly and including an assertion

Removing CramQC requirement from GenotypeMito as CramQC is part of al…

74a4d09

…ignment pipeline. Performing existence check of verifybamID from CramQC stage and logging to the user to run the alignment pipeline to produce output if it doesn't exist

Slight change to logging of verifybamID output

e30d369

Reverting align job to original to reduce PR complexity and avoid con…

86ce689

…fusion. Simplification of align.py (job) and removal of Bazam tool is now in a separate branch and PR

Removing debugging statements from inputs.py

e660c92

Removing analysis.meta versioning attempt

d9379cf

Removing analysis.meta versioning attempt

11e32d7

Removing unnecessary comments

cdf6b19

Removing required_stages from stage decorator in Stripy

a794037

Checking how cram's are detected

3471ce7

Checking presence of CRAM before passing a cram path to genotype.geno…

5e659c6

…type() function.

Resolved merge conflict

18b000d

Fixing test import

54ef1a3

Fixing test import

da00ae7

Because continuity of stage inputs is broken, we have to manually add…

2bcfa08

… cram and gvcf paths to the sequencing group in the test

Merge branch 'main' into separate-align-genotype

514c765

Making logging statements consistent across pipelines

8743971

Merge branch 'separate-align-genotype' of https://github.com/populati…

63ba3a3

…ongenomics/production-pipelines into separate-align-genotype

Merge branch 'main' into separate-align-genotype

7206ce2

michael-harper requested review from vivbak and MattWellie June 26, 2024 05:22

michael-harper marked this pull request as ready for review June 26, 2024 05:23

MattWellie requested changes Jul 3, 2024

View reviewed changes

MattWellie mentioned this pull request Jul 4, 2024

Move to src/ layout? #813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

michael-harper commented Jun 5, 2024 •

edited

Loading

MattWellie Jul 3, 2024

MattWellie Jul 3, 2024

MattWellie Jul 3, 2024 •

edited

Loading

MattWellie Jul 3, 2024

MattWellie Jul 3, 2024

MattWellie commented Jul 4, 2024 •

edited

Loading

MattWellie commented Jul 4, 2024

		@@ -26,7 +26,8 @@ def _check_gvcfs(sequencing_groups: list[SequencingGroup]) -> list[SequencingGro
		f'Sequencing group {sequencing_group} is missing GVCF. '
		f'Use workflow/skip_sgs = [] or '

	from ... import get_batch
	from cpg_utils.hail_batch import get_batch

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

Are you sure you want to change the base?

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

Conversation

michael-harper commented Jun 5, 2024 • edited Loading

MattWellie Jul 3, 2024

Choose a reason for hiding this comment

MattWellie Jul 3, 2024

Choose a reason for hiding this comment

MattWellie Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

MattWellie Jul 3, 2024

Choose a reason for hiding this comment

MattWellie Jul 3, 2024

Choose a reason for hiding this comment

MattWellie commented Jul 4, 2024 • edited Loading

MattWellie commented Jul 4, 2024

michael-harper commented Jun 5, 2024 •

edited

Loading

MattWellie Jul 3, 2024 •

edited

Loading

MattWellie commented Jul 4, 2024 •

edited

Loading