Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish developing/evaluating hapestry merge #15

Open
20 of 36 tasks
rlorigro opened this issue Jul 25, 2024 · 12 comments
Open
20 of 36 tasks

Finish developing/evaluating hapestry merge #15

rlorigro opened this issue Jul 25, 2024 · 12 comments
Assignees

Comments

@rlorigro
Copy link
Collaborator

rlorigro commented Jul 25, 2024

  • Move the WDL to appropriate location?
  • Check compatibility with existing pipeline (what are required inputs/outputs)
  • Fix exact duplicates issue (using bcftools or within hapestry)
  • Fix memory error with SCIP solver
  • Find a way to set time limit on solutions (ideally using existing MathOpt API)
  • Address infeasibility-by-construction issue
  • See why alignment identity is seemingly capped at a low maximum
  • See why haplotype coverage is seemingly capped at a low maximum
  • See why some sites have <94 alignments
  • Find Dipcall VCF that includes events down to 10bp for eval
  • Address non-ref path preference by giving ref path an n_cost of 0
  • Check that VcfReader is compatible with TRGT and try using it
  • Finish Gurobi license application
    • Integrate into C++
    • Integrate into WDL/Terra, adding a private bucket to hide license from non-grantees
    • Find a way to avoid re-checking the license every optimization
  • Evaluate whole genome on HPRC 47 benchmark dataset
    • Compare quadratic and simplified linear objectives w.r.t. time/accuracy
    • Compare SCIP and Gurobi solver time (once license is available)
  • Evaluate on the 1074 AoU + HPRC by extracting just the 47 HPRC samples to see how the population scale changed outcomes
  • Add SNPs and small indels to hapestry (@fabio-cunial)
  • Find a way to remove variants that do not contribute any unique path sequences to the graph before aligning reads (@fabio-cunial maybe interested?)
  • Test/evaluate the path deduplication hack
  • Annotate variants that are skipped when optimization fails
  • Fix logging for optimizer steps and harmonize with the GraphAligner log
  • Tune parameters with Vcfdist or GraphEval as objective fn (Attempt to tune Hapestry parameters (maybe with Optuna WDL)? #51)
  • profile hapestry memory usage and reduce it #55
  • Print to VCF every window that fails in hapestry_merge and find_windows
  • Generate optional substitution VCF and test with VCFdist after Hiphase
  • Make chunk_vcf also chunk the reference FASTA so every worker doesn't load 3GB unnecessarily
  • Devise a smarter window finding strategy that takes into account some metric of sample and population divergence rate
  • Infer a quality for output variants using the read-hap edge weights
  • Find a way to plot the solution objective value w.r.t. time in the solver
  • Fix ERROR: multiple paths are ref-only nodes: <3<2<1 != >1>2>3
  • Add WFA all-vs-all step to the log CSV
  • Cache d_min solution and use it as fallback when joint solver fails
@samuelklee
Copy link
Collaborator

Great to see some boxes getting checked! 👍 Might be nice to record some notes about the actions taken here, if they were meaty enough and you feel like it would be useful for yourself or others to see at some point.

@rlorigro
Copy link
Collaborator Author

rlorigro commented Aug 6, 2024

Thanks for checking in, I can point to my commits with a bit of explanation:

  1. @fabio-cunial fixed an issue with BCFtools being cowardly and not merging identical SVs that belonged to different samples (?)
  2. I found the parameters that allow you to add a timeout using the built in methods of the solver (surprisingly non-trivial to find any info about this, because MathOpt is so new it has no documentation)
  3. I added an additional solver to the preprocessing which fixes the occasional ploidy constraint violation that I observed, which can happen as a result of filtering alignments or mismapped/contaminant reads. The solution I went with was to remove the minimal number of reads that cause infeasibliity. In this diagram you can see that the sample must have 3 haplotypes (paths) assigned to it, which violates the n=2 ploidy constraint:
    image
  4. I finished the Gurobi license application process and we were granted a Web License which allows up to 400 concurrent solvers. It works fairly seamlessly with MathOpt, but it re-checks the license every optimization, which is wasteful. There is no documentation about how to pass an existing Gurobi environment to the backend. I will wait to revisit this until after the other higher priority issues are resolved.
  5. I have started evaluating and doing a parameter grid search on HPRC 47 chr1, and we are getting promising first results (see below) but it is not quite reproducing the prototype results yet, so I need to do some more digging to see why alignment identity is seemingly capped at a low maximum
    image

@rlorigro
Copy link
Collaborator Author

After fixing multiple issues, the most egregious of which was forgetting to use the GIAB confident BED uniformly in the experiment, I am getting much better performance. Here are the results for various d_weight in the objective, where d_weight is a scalar that multiples the cost of the edit distance relative to the cost of adding a new haplotype to the solution.

d_weight=1

image

d_weight=4

image

d_weight=32

image

I think there is still room for improvement, so I will continue to investigate the missing haplotype_coverage. @samuelklee since this is nearing a viable output, is there some WDL we should run to see how the downstream performance is for Hapestry on the 47 HPRC samples? It might be interesting to see.

The next major milestone for improving imputation results will probably be the inclusion of small vars, which @fabio-cunial is working on.

@samuelklee
Copy link
Collaborator

This sounds great! And yes, take a look at PhasedPanelEvaluation—ideally you should just be able to slot your VCF into the joint_sv_vcf input (currently the recall = 0.7 kanpig intra + truvari inter VCF). Let me know if you run into any issues!

@rlorigro
Copy link
Collaborator Author

For the purpose of comparing to kanpig is there any way to guarantee equal subsetting of the VCF by a “confident” BED? I don’t want to make the same mistake twice and compare with unequal/missing regions

@samuelklee
Copy link
Collaborator

The Vcfdist task takes in a BED file, right now we are using GIAB GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed (with a trivial header issue fixed up, I believe), but that can be changed. The overlap metric is calculated over the whole genome, but that could also be subset pretty easily.

@rlorigro
Copy link
Collaborator Author

I see, I didn't realize we were skipping all the tandems. Is that because of some issue caused by multiallelics? they are probably the biggest area of improvement for hapestry vs truvari

@samuelklee
Copy link
Collaborator

I think the choice of that BED file was perhaps arbitrary. I would hope that we expand the Vcfdist task to take in an array of BEDs and stratify. We'll get there!

@rlorigro
Copy link
Collaborator Author

It looks like all of your evaluation is on GRCH38 😐

@samuelklee
Copy link
Collaborator

The workflow itself should be fairly reference agnostic, although there may be one or two resources in the evaluations that aren’t as readily available for CHM.

@rlorigro
Copy link
Collaborator Author

Given that there are about 18 fields in the JSON that would need to be changed I think I will wait until I rerun hapestry on GRCH38.

@samuelklee
Copy link
Collaborator

(Also note from #1 that this was the plan from the start… I’d rather not spend time on doing HPRC-only for two references, given that we’re supposedly getting AoU1 access back tomorrow!)

@rlorigro rlorigro changed the title Finish developing/evaluting hapestry merge Finish developing/evaluating hapestry merge Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants