Automated evaluation of current end-to-end pipeline. #1

samuelklee · 2024-07-24T20:31:45Z

Meta-issue. Please spin out issues and self-assign.

For the first cut, let's organize WDLs and resources for a pipeline that takes

Inputs

the joint short-variant callset
the integrated SV callset
any required resources

and goes through

Methods

physical phasing with HiPhase
short + SV concatenation, variant deduplication, and allele-frequency filtering
statistical phasing with Shapeit4
preprocessing and bubble creation for PanGenie

covered by per-stage evaluations (as sensible)

Evaluations

vcfdist vs. HPRC dipcall truth
bipartite-graph checks
inconsistency checks for phased haplotypes
missingness metrics

For now, freely open PRs and merge without review---but please do use descriptive commit messages and PR titles. Furthermore, please commit fresh copies of all relevant WDLs and indicate versioned provenance and provide a link in a corresponding PR comment, if appropriate/possible.

I will organize the end-to-end evaluation in a megaWDL and then do a round of cleanup of the subworkflows after an initial manual run (with the goal being to show that cleanup does not affect performance). I expect running this evaluation to be manual for the near future, but we can think about CI testing later if it makes sense.

We'll continue to work with hg38 chr1:100-110Mbp to start.

Once this settles (hopefully within a week or two), we'll be better able to see where @rlorigro can slot in Hapestry methods and demonstrate improvement. If it makes sense, we can expand coverage of the pipelines upstream to intra/intersample integration and downstream to SR genotyping/phasing/imputation.

@rlorigro can run an end-to-end evaluation on his own, understands the inputs/outputs, and feels that he can either use the evaluation to inform Hapestry development or suggest improvements to the evaluation itself.

samuelklee · 2024-08-01T17:26:05Z

#20 was just merged and gets us most of the way there. Some remaining TODOs:

Add other non-vcfdist evaluations. Perhaps do these all in one subworkflow per sample and VCF stage? EDIT: Added my naive overlap check in Refactored Vcfdist WDL and added calculation of cohort-level overlap metrics. #22.
Add a step to summarize all evaluations over all samples for each VCF stage. EDIT: Added in Added SummarizeEvaluations task to PhasedPanelEvaluation. #35. The evaluation summary can perhaps be expanded later.

But even before adding these, note that the drop in recall in the last VCF produced by PanGeniePanelCreation (e.g., for SVs, from 82% in VcfdistEvaluationShapeit4 to ~60% in VcfdistEvaluationPanel) is a good enough indicator of remaining overlap issues to guide development for now. Again, this drop results solely from the PanGenie script removing an entire allele if it is found to overlap in any one sample. In this case, the 6 additional FNs (note one locus with multiple removed alleles) are at:

Two overlapping variants at same haplotype at chr1:100528654, set allele to missing.
Two overlapping variants at same haplotype at chr1:104688481, set allele to missing.
Two overlapping variants at same haplotype at chr1:105788297, set allele to missing.
Two overlapping variants at same haplotype at chr1:107555333, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.
Two overlapping variants at same haplotype at chr1:108689608, set allele to missing.

It is easy to see how this problem exacerbates when we have many more samples.

In any case, we should already have what we need for some simple experiments, e.g.:

@samuelklee can modify PanGenie panel creation so that genotypes are set to missing, rather than whole alleles being dropped. (Hopefully this gets us most of the way there, but we should still try to get a self consistent output from the phasing pipeline.) EDIT: Obviated by below.
@fabio-cunial can insert a more sophisticated cleanup step after HiPhase and/or Shapeit4.
@hangsuUNC can experiment with phase blocks (Test using Shapeit4 with --use-PS option #14 and Finish developing/evaluating hapestry merge #15).
@hangsuUNC can insert a cleanup step before HiPhase (Clean up phasing info before Hiphase #11, although it might be better to wait until inconsistency metrics have been added for this one).
@rlorigro can sub in a Hapestry chr1:100-110Mbp VCF for the current kanpig intra + truvari inter input.
@fabio-cunial can likewise sub in a kanpig intra + kanpig inter input.

samuelklee · 2024-08-01T19:46:28Z

@kvg probably a good point for you to take a look and get caught up. Sorry, took just over a week 😆

samuelklee · 2024-09-20T20:47:57Z

I think we can consider this first push complete!

rlorigro · 2024-09-20T20:49:20Z

vcfdist still broken :(

rlorigro · 2024-09-20T20:49:40Z

but yes pipeline runs end to end 🎉

samuelklee · 2024-09-20T20:50:53Z

File an issue 😜 I did enable those arguments in a test run; didn’t seem to move the numbers too much, but maybe the story is different for tandems.

samuelklee assigned samuelklee, rlorigro and hangsuUNC Jul 24, 2024

samuelklee mentioned this issue Aug 1, 2024

Added first iteration of PhasedPanelEvaluation with PanGeniePanelCreation and refactored phasing pipeline. #20

Merged

rlorigro added the good first issue Good for newcomers label Aug 1, 2024

samuelklee mentioned this issue Aug 26, 2024

Finish developing/evaluating hapestry merge #15

Open

36 tasks

samuelklee closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated evaluation of current end-to-end pipeline. #1

Automated evaluation of current end-to-end pipeline. #1

samuelklee commented Jul 24, 2024 •

edited

Loading

samuelklee commented Aug 1, 2024 •

edited

Loading

samuelklee commented Aug 1, 2024

samuelklee commented Sep 20, 2024

rlorigro commented Sep 20, 2024

rlorigro commented Sep 20, 2024

samuelklee commented Sep 20, 2024

Automated evaluation of current end-to-end pipeline. #1

Automated evaluation of current end-to-end pipeline. #1

Comments

samuelklee commented Jul 24, 2024 • edited Loading

samuelklee commented Aug 1, 2024 • edited Loading

samuelklee commented Aug 1, 2024

samuelklee commented Sep 20, 2024

rlorigro commented Sep 20, 2024

rlorigro commented Sep 20, 2024

samuelklee commented Sep 20, 2024

samuelklee commented Jul 24, 2024 •

edited

Loading

samuelklee commented Aug 1, 2024 •

edited

Loading