Explore and document conventions for variant representation and implications for phasing, etc. #24

samuelklee · 2024-08-09T18:51:05Z

Touched upon in the internal meeting just now as well as at the end #11 (comment).

The goal is to understand representation conventions at each stage so that we can 1) ensure optimal results from each method and 2) easily iterate on methods/modules. (For example, if kanpig/HiPhase emit representations with certain conventions, then we'd want to understand whether hapestry should respect those conventions and understand the implications for evaluations.) This includes understanding the implications of hom-ref, missing, etc. genotypes when reconstructing haplotypes from a given set of records (which may or may not be self consistent).

I think we should work ultimately towards a scenario in which the output of any read-backed tool contains enough information to unambiguously yield the corresponding single-sample diploid haplotype bubbles for both short variants and SVs before Shapeit4, given documented conventions; any variants that are unphased within a bubble are dropped and extraneous alleles are trimmed (rather than being represented as hom ref or missing). These bubbles would essentially be disjoint read-backed phase sets, whose relative phasing with each other would be refined via statistical phasing alone.

There still remains the issue of making things nice at the multisample level. One option would then be to convert these single-sample bubbles to explicit alleles and multisample bubbles before Shapeit4 (essentially the PanGenie bubble creation we do before KAGE+GLIMPSE). This has the limitation that we will not be able to switch haplotypes mid-bubble when doing haplotype copying, so we may need to toss out or collapse overly large alleles (but again, same thing we are assuming in KAGE+GLIMPSE). But I think we would at least be guaranteed that Shapeit4 would not introduce new inconsistent paths; at the very worst, it might assign more than two consistent paths in a given bubble to a single sample, in which case we would just pick the best two according to some criteria.

samuelklee · 2024-08-09T19:34:14Z

Paraphrasing some discussion with @fabio-cunial just now, we can consider the following iterations of the problem:

I give you an inconsistent Shapeit4 VCF, you give me back consistent haplotypes and tell me exactly how to reconstruct them from the VCF you give me. (Already solved by Fabio.)
I give you an inconsistent and not fully phased short+SV concatenated single-sample HiPhase VCF, you give me back the thing that is best for Shapeit4. (Clearly needs some experimentation and better understanding of Shapeit4 behavior.)
Finally, the last stage is we give Ryan the integrated SV callset, the short variant callset, and the reads, and he gives us back the thing that is best for Shapeit4.

And clearly there is some nontrivial interplay between "optimally" solving the inconsistency problem at the single-sample level and influencing what is best for Shapeit4 at the multisample level.

samuelklee mentioned this issue Aug 13, 2024

Added --extra-args to Shapeit4 and bubbled up more PhysicalAndStatisticalPhasing arguments in PhasedPanelEvaluation. #25

Merged

samuelklee mentioned this issue Aug 26, 2024

Added FixVariantCollisions to PhasedPanelEvaluation WDL. #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore and document conventions for variant representation and implications for phasing, etc. #24

Explore and document conventions for variant representation and implications for phasing, etc. #24

samuelklee commented Aug 9, 2024 •

edited

Loading

samuelklee commented Aug 9, 2024

Explore and document conventions for variant representation and implications for phasing, etc. #24

Explore and document conventions for variant representation and implications for phasing, etc. #24

Comments

samuelklee commented Aug 9, 2024 • edited Loading

samuelklee commented Aug 9, 2024

samuelklee commented Aug 9, 2024 •

edited

Loading