Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated evaluation of current end-to-end pipeline. #1

Closed
1 task done
samuelklee opened this issue Jul 24, 2024 · 6 comments
Closed
1 task done

Automated evaluation of current end-to-end pipeline. #1

samuelklee opened this issue Jul 24, 2024 · 6 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@samuelklee
Copy link
Collaborator

samuelklee commented Jul 24, 2024

Meta-issue. Please spin out issues and self-assign.

For the first cut, let's organize WDLs and resources for a pipeline that takes

Inputs

  • the joint short-variant callset
  • the integrated SV callset
  • any required resources

and goes through

Methods

  • physical phasing with HiPhase
  • short + SV concatenation, variant deduplication, and allele-frequency filtering
  • statistical phasing with Shapeit4
  • preprocessing and bubble creation for PanGenie

covered by per-stage evaluations (as sensible)

Evaluations

  • vcfdist vs. HPRC dipcall truth
  • bipartite-graph checks
  • inconsistency checks for phased haplotypes
  • missingness metrics

For now, freely open PRs and merge without review---but please do use descriptive commit messages and PR titles. Furthermore, please commit fresh copies of all relevant WDLs and indicate versioned provenance and provide a link in a corresponding PR comment, if appropriate/possible.

I will organize the end-to-end evaluation in a megaWDL and then do a round of cleanup of the subworkflows after an initial manual run (with the goal being to show that cleanup does not affect performance). I expect running this evaluation to be manual for the near future, but we can think about CI testing later if it makes sense.

We'll continue to work with hg38 chr1:100-110Mbp to start.

Once this settles (hopefully within a week or two), we'll be better able to see where @rlorigro can slot in Hapestry methods and demonstrate improvement. If it makes sense, we can expand coverage of the pipelines upstream to intra/intersample integration and downstream to SR genotyping/phasing/imputation.

  • @rlorigro can run an end-to-end evaluation on his own, understands the inputs/outputs, and feels that he can either use the evaluation to inform Hapestry development or suggest improvements to the evaluation itself.
@samuelklee
Copy link
Collaborator Author

samuelklee commented Aug 1, 2024

#20 was just merged and gets us most of the way there. Some remaining TODOs:

But even before adding these, note that the drop in recall in the last VCF produced by PanGeniePanelCreation (e.g., for SVs, from 82% in VcfdistEvaluationShapeit4 to ~60% in VcfdistEvaluationPanel) is a good enough indicator of remaining overlap issues to guide development for now. Again, this drop results solely from the PanGenie script removing an entire allele if it is found to overlap in any one sample. In this case, the 6 additional FNs (note one locus with multiple removed alleles) are at:

Two overlapping variants at same haplotype at chr1:100528654, set allele to missing.
Two overlapping variants at same haplotype at chr1:104688481, set allele to missing.
Two overlapping variants at same haplotype at chr1:105788297, set allele to missing.
Two overlapping variants at same haplotype at chr1:107555333, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.

It is easy to see how this problem exacerbates when we have many more samples.

In any case, we should already have what we need for some simple experiments, e.g.:

@samuelklee
Copy link
Collaborator Author

@kvg probably a good point for you to take a look and get caught up. Sorry, took just over a week 😆

@samuelklee
Copy link
Collaborator Author

I think we can consider this first push complete!

@rlorigro
Copy link
Collaborator

vcfdist still broken :(

@rlorigro
Copy link
Collaborator

but yes pipeline runs end to end 🎉

@samuelklee
Copy link
Collaborator Author

File an issue 😜 I did enable those arguments in a test run; didn’t seem to move the numbers too much, but maybe the story is different for tandems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants