Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore and document conventions for variant representation and implications for phasing, etc. #24

Open
samuelklee opened this issue Aug 9, 2024 · 1 comment

Comments

@samuelklee
Copy link
Collaborator

samuelklee commented Aug 9, 2024

Touched upon in the internal meeting just now as well as at the end #11 (comment).

The goal is to understand representation conventions at each stage so that we can 1) ensure optimal results from each method and 2) easily iterate on methods/modules. (For example, if kanpig/HiPhase emit representations with certain conventions, then we'd want to understand whether hapestry should respect those conventions and understand the implications for evaluations.) This includes understanding the implications of hom-ref, missing, etc. genotypes when reconstructing haplotypes from a given set of records (which may or may not be self consistent).

I think we should work ultimately towards a scenario in which the output of any read-backed tool contains enough information to unambiguously yield the corresponding single-sample diploid haplotype bubbles for both short variants and SVs before Shapeit4, given documented conventions; any variants that are unphased within a bubble are dropped and extraneous alleles are trimmed (rather than being represented as hom ref or missing). These bubbles would essentially be disjoint read-backed phase sets, whose relative phasing with each other would be refined via statistical phasing alone.

There still remains the issue of making things nice at the multisample level. One option would then be to convert these single-sample bubbles to explicit alleles and multisample bubbles before Shapeit4 (essentially the PanGenie bubble creation we do before KAGE+GLIMPSE). This has the limitation that we will not be able to switch haplotypes mid-bubble when doing haplotype copying, so we may need to toss out or collapse overly large alleles (but again, same thing we are assuming in KAGE+GLIMPSE). But I think we would at least be guaranteed that Shapeit4 would not introduce new inconsistent paths; at the very worst, it might assign more than two consistent paths in a given bubble to a single sample, in which case we would just pick the best two according to some criteria.

@samuelklee
Copy link
Collaborator Author

Paraphrasing some discussion with @fabio-cunial just now, we can consider the following iterations of the problem:

  1. I give you an inconsistent Shapeit4 VCF, you give me back consistent haplotypes and tell me exactly how to reconstruct them from the VCF you give me. (Already solved by Fabio.)
  2. I give you an inconsistent and not fully phased short+SV concatenated single-sample HiPhase VCF, you give me back the thing that is best for Shapeit4. (Clearly needs some experimentation and better understanding of Shapeit4 behavior.)
  3. Finally, the last stage is we give Ryan the integrated SV callset, the short variant callset, and the reads, and he gives us back the thing that is best for Shapeit4.

And clearly there is some nontrivial interplay between "optimally" solving the inconsistency problem at the single-sample level and influencing what is best for Shapeit4 at the multisample level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant