-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783
base: main
Are you sure you want to change the base?
Conversation
…. Also removing the required_stage of align from Genotype stage
…under the cpg_workflows/stages directory. Also moving cram_qc.py and gvcf_qc.py into cpg_workflows/stages/alignment and cpg_workflows/stages/genotype respectively to be more in line with these new pipelines 'owning' these stages
… import conflicts and changing import in gvcf_qc.py to be a relative import so that it can successfully import the class Genotype
…e mark duplicates tool (Picard)
…h job submission in Hail Batch
…fault genotype.toml file
…r a sequencing group we can specify in the config file whether we want to realign from cram or fastq
…ram file by referening sequencing_group.cram directly and including an assertion
…ignment pipeline. Performing existence check of verifybamID from CramQC stage and logging to the user to run the alignment pipeline to produce output if it doesn't exist
…fusion. Simplification of align.py (job) and removal of Bazam tool is now in a separate branch and PR
…enotype pipelines however can still use the stages at a later date. For example, when we create custom cohorts and do QC, we would want to be able to run MultiQC to generate a summary report across all the individual CRAM QC metrics for the samples in that custom cohort
… cram and gvcf paths to the sequencing group in the test
…ongenomics/production-pipelines into separate-align-genotype
@@ -26,7 +26,8 @@ def _check_gvcfs(sequencing_groups: list[SequencingGroup]) -> list[SequencingGro | |||
f'Sequencing group {sequencing_group} is missing GVCF. ' | |||
f'Use workflow/skip_sgs = [] or ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this argument will be disappearing soon following the implementation of custom cohorts, @vivbak ?
@stage( | ||
analysis_type='cram', | ||
analysis_keys=['cram'], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stage( | |
analysis_type='cram', | |
analysis_keys=['cram'], | |
) | |
@stage(analysis_type='cram', analysis_keys=['cram']) |
from enum import Enum | ||
from logging import config | ||
from pickle import MARK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of unused imports here, not sure how this passed linting checks
stage, | ||
) | ||
|
||
from .. import get_batch | ||
from .align import Align | ||
from ... import get_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of this at all. We're installing cpg_workflows as a package in the image we intend to run. Relative imports are garbage.
This is also the wrong import, we're using cpg_utils.hail_batch.get_batch
from ... import get_batch | |
from cpg_utils.hail_batch import get_batch |
@@ -24,6 +23,8 @@ | |||
stage, | |||
) | |||
|
|||
from .genotype import Genotype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BLEH RELATIVE IMPORTS
I think it's worth having a conversation about how to segregate this properly - i.e. you've sub-divided these stages into There's a broader issue of how this pipeline is run - we start the driver image by doing a git clone of A blue sky thinking approach to this division:
Undoubtably this is more effort, it would involve reorganising the production pipelines codebase and having to build a new version of the cpg_workflows image for each test run in GCP. However, it would resolve a fundamental ambiguity in our production pipelines workflows - This is kinda twinned with #647, a pet peeve of mine - we're aware, and kind of.. OK? with not really knowing exactly what code is going to run in our main pipeline. n.b. this isn't a proposal so much as it is a kind of... rant. What we have is not good, I don't think this PR solves that (it covers physical separation of the stages, but doesn't address things we should probably address first, like "what code are we running") ^^ I actually don't know if any of this is feasible |
See the context here in the background of the scoping document
Summary
This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.
Proposed Changes
Considerations