Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract 'Alignment' and 'Genotype' to be separate independent pipelines #783

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

michael-harper
Copy link
Contributor

@michael-harper michael-harper commented Jun 5, 2024

See the context here in the background of the scoping document

Summary
This pull request proposes separating the alignment and genotyping stages into two distinct workflows within the production pipelines. These changes aim to enhance flexibility, control, and allow future changes to be on parity with industry best practices, particularly those established by Illumina's DRAGEN hardware. Additionally, version control, once implemented will be more easily implemented with separated alignment and genotyping from downstream workflows.

Proposed Changes

  1. Separation of Pipelines:
  • Alignment and Genotyping: Split into standalone pipelines to prevent versioning conflicts and ensure consistent inputs.
  • Modularity: Allows independent updates and optimisations for each workflow without mutual disruption.
  1. Pipeline Starting Points:
  • Detection of CRAM and gVCF Files: Adjust pipelines to ensure clear and informative error messages for missing data, and allow manual triggering of the genotyping pipeline as needed.
  • Resource Dependencies: Replace stage dependencies with resource dependencies, requiring a CPG-processed CRAM in Metamist before running the genotyping pipeline, as well as either fastq or CRAM files registered in Metamist prior to running the alignment pipeline.
  1. Repository Structure:
  • Current Limitations: The existing structure is not conducive to navigating and understanding independent pipelines.
  • Proposed Structure: Consider separate repositories for the production pipelines API and actual pipelines, with clear folder structures for individual pipelines and shared resources. This PR implements an interim folder structure.
  • This PR will provide an interim folder structure for the alignment and genotyping pipelines.

Considerations

  • Integration with Custom Cohorts: Ensure that updates to cohorts with additional samples or sequencing groups are managed without disrupting the pipeline.
  • User Responsibilities: Users must ensure that samples in custom cohorts have the required gVCF files before running the pipeline.
  • Further Discussion: Topics such as repository restructuring and shared resource versioning require further discussion.
  • Version control: Yet to be defined but this separation will hopefully help future efforts in this domain
  • Breaking continuity: Production-pipelines is designed to automatically trigger stages when inputs do not exist. Extracting alignment and genotyping pipelines from downstream workflows breaks this continuity. Users will need to manually trigger alignment and/or genotyping pipelines to ensure that all sequencing groups have the correct input for downstream analysis. This break in continuity, although contrary to the design logic of production-pipelines, ensures the separation of pipelines for future version control efforts and prevents erroneous pipeline runs without the correct inputs for all samples.

…. Also removing the required_stage of align from Genotype stage
…under the cpg_workflows/stages directory. Also moving cram_qc.py and gvcf_qc.py into cpg_workflows/stages/alignment and cpg_workflows/stages/genotype respectively to be more in line with these new pipelines 'owning' these stages
… import conflicts and changing import in gvcf_qc.py to be a relative import so that it can successfully import the class Genotype
…r a sequencing group we can specify in the config file whether we want to realign from cram or fastq
michael-harper and others added 23 commits June 24, 2024 14:38
…ram file by referening sequencing_group.cram directly and including an assertion
…ignment pipeline. Performing existence check of verifybamID from CramQC stage and logging to the user to run the alignment pipeline to produce output if it doesn't exist
…fusion. Simplification of align.py (job) and removal of Bazam tool is now in a separate branch and PR
…enotype pipelines however can still use the stages at a later date. For example, when we create custom cohorts and do QC, we would want to be able to run MultiQC to generate a summary report across all the individual CRAM QC metrics for the samples in that custom cohort
… cram and gvcf paths to the sequencing group in the test
@michael-harper michael-harper marked this pull request as ready for review June 26, 2024 05:23
@@ -26,7 +26,8 @@ def _check_gvcfs(sequencing_groups: list[SequencingGroup]) -> list[SequencingGro
f'Sequencing group {sequencing_group} is missing GVCF. '
f'Use workflow/skip_sgs = [] or '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this argument will be disappearing soon following the implementation of custom cohorts, @vivbak ?

Comment on lines +24 to +27
@stage(
analysis_type='cram',
analysis_keys=['cram'],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@stage(
analysis_type='cram',
analysis_keys=['cram'],
)
@stage(analysis_type='cram', analysis_keys=['cram'])

Comment on lines +6 to +8
from enum import Enum
from logging import config
from pickle import MARK
Copy link
Contributor

@MattWellie MattWellie Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of unused imports here, not sure how this passed linting checks

stage,
)

from .. import get_batch
from .align import Align
from ... import get_batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of this at all. We're installing cpg_workflows as a package in the image we intend to run. Relative imports are garbage.

This is also the wrong import, we're using cpg_utils.hail_batch.get_batch

Suggested change
from ... import get_batch
from cpg_utils.hail_batch import get_batch

@@ -24,6 +23,8 @@
stage,
)

from .genotype import Genotype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLEH RELATIVE IMPORTS

@MattWellie
Copy link
Contributor

MattWellie commented Jul 4, 2024

I think it's worth having a conversation about how to segregate this properly - i.e. you've sub-divided these stages into cpg_workflows.stages.XXX, which feels like a very partial solution. That puts us in a weird position where we're trying to separately version a few files inside cpg_workflows, which is itself versioned at the top level.

There's a broader issue of how this pipeline is run - we start the driver image by doing a git clone of production-pipelines into the VM, then we run main.py. That's fine. But when we import cpg_workflows.XXX it imports from the relative path cpg_workflows.XXX that we just cloned into, not the installed version of the codebase (because AFAIK local filepaths are traversed first when Python locates modules). That leads to 'fun' issues where the version of the code being run is based on the current commit, up until we run a python job or a query_command - those are executed in a new VM without the git clone, so they read from the installed version of the codebase (potentially a different version of the codebase). I don't think that should really be addressed by this change, but we should resolve this ambiguity.

A blue sky thinking approach to this division:

- src
  - alignment_workflow
    - stages
      - align
      - cram_qc
    - version.py
  - genotype_workflow
    - stages
      - genotype
      - gvcf_qc
    - version.py
  - gatk_sv_workflow
    - stages
    - version.py
  - large_cohort_workflow
    - stages
    - version.py
  - rare_disease_workflow
    - stages
    - version.py
  - ...? other pipelines we want to version separately
  - cpg_workflows (the library which represents the abstract pipeline - interactions with metamist, setting up a workflow graph, etc.)
- main.py (imports from all the relevant INSTALLED pipelines and builds workflows from them)
- tests
- pyproject.toml
  • This would enable us to ensure we're only using the installed version of the code (from cpg_workflows.stages.XXX import YYY is guaranteed to use the imported code, as there's no local path .cpg_workflows, so we know we're using the installed code)
  • This would enable use to split out the pipeline framework from the stages that rely on it
  • This would enable use to version each coherent pipeline separately

Undoubtably this is more effort, it would involve reorganising the production pipelines codebase and having to build a new version of the cpg_workflows image for each test run in GCP.

However, it would resolve a fundamental ambiguity in our production pipelines workflows - when we run our pipelines, we don't truly know which version of the code we're running. That is a very stupid problem to have, when this whole effort is around versioning our pipelines more specifically. We ARE using cloud compute, so we ARE using containerised pipelines. With our current structure, we're using some of the code in the container, and some pulled in from github at runtime.

This is kinda twinned with #647, a pet peeve of mine - we're aware, and kind of.. OK? with not really knowing exactly what code is going to run in our main pipeline.

n.b. this isn't a proposal so much as it is a kind of... rant. What we have is not good, I don't think this PR solves that (it covers physical separation of the stages, but doesn't address things we should probably address first, like "what code are we running")

^^ I actually don't know if any of this is feasible

@MattWellie MattWellie mentioned this pull request Jul 4, 2024
@MattWellie
Copy link
Contributor

As a side note, this is a huge leap considering that AFAIK the RFC/scoping doc here has not been reviewed by anyone in software, and there are a number of discussion points identifying that @vivbak and other members of data/software should be consulted before going ahead with implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants