Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Alignment Job and Resolve SAM Flag Issue in CRAM to FASTQ Conversion #802

Draft
wants to merge 90 commits into
base: main
Choose a base branch
from

Conversation

michael-harper
Copy link
Contributor

@michael-harper michael-harper commented Jun 25, 2024

Context and Motivation

The primary motivation for these changes is to address an underlying bug in bazam, the tool we previously used for converting CRAM files to FASTQ format for re-alignment. The bug causes the SAM flag relating to read pair orientation to be overwritten for correctly oriented read pairs on the reverse strand. This is critical because it impacts the accurate representation of read pair orientations in re-aligned CRAM files. Notably, this issue does not affect the individual read strand flag, which remains correctly preserved.

Changes Introduced

  • Removal of bazam for CRAM to fastq conversion in favour of samtools fastq:
    • Replaced the use of bazam with direct extraction of FASTQ files using samtools fastq, ensuring the integrity of SAM flags.
    • The validation of samtools fastq is discussed in this Slack post.
  • Removal of unused aligners (BWA and BWA-MEM) in favour of just Dragmap.
  • Removal of unused duplicate marking tools, only use picard to mark duplicates now.
  • Simplified Alignment Job Workflow:
    • Refactored _get_alignment_input. This was capable of realigning from a specified 'cram version' but was seemingly unable to produce new cram versions when we wanted to realign. I have added the functionality to realign from an input cram and subsequently save the cram to the newly created 'new_version' directory.
    • This was done to be in line with previous code in the align job that already referred to ['workflow']['realign_from_cram_version']
    • Adjusted the _align_one function to handle direct FASTQ extraction and alignment using the selected aligner.
  • Enhanced Logging and Error Handling:
    • Added informative logging regarding cram realignment process as well as to track the progress and actions taken during the alignment process.
    • Improved error handling to provide clearer messages when alignment inputs are missing or incorrect.
  • The ability to subset a cram was valuable during testing to improve the turnaround time. I have left the subset_cram function in for future use/testing purposes (it is not currently used within the stage).

…emoval of Bazam when realigning from cram file
…ferentiate between picard extracted and samtools extracted. Also adding config parameter 'extract_picard' to trigger instead of hardcoding into align job
…eaved fastq file as an input to dragen-os aligner. This is to attempt to reproduce the same cram as the input
…The previous approach only allowed for specific versions of crams to be realigned, and we were essentially unable to create a new version of the cram. The new approach allows for specifying if we want to realign from the 'base' cram and creates a new path to the realigned cram as well as the ability to specify a reference to use during realigning
…m. Adding documentation to large_cohort.toml to accurately reflect new usecase
…options to the align stage level to avoid multiple config retrievals. Making code more readable and easier to follow. Also reconfiguring how we parameterise realignment options, providing full explanation in large_cohort.toml file.
…at can be passed to the align job so that if required we can retrieve necessary cram file specified in the newly implemented RealignOptions object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant