Finish developing/evaluating hapestry merge #15

rlorigro · 2024-07-25T21:05:19Z

The text was updated successfully, but these errors were encountered:

samuelklee · 2024-08-06T13:42:02Z

Great to see some boxes getting checked! 👍 Might be nice to record some notes about the actions taken here, if they were meaty enough and you feel like it would be useful for yourself or others to see at some point.

rlorigro · 2024-08-06T15:40:02Z

Thanks for checking in, I can point to my commits with a bit of explanation:

@fabio-cunial fixed an issue with BCFtools being cowardly and not merging identical SVs that belonged to different samples (?)
I found the parameters that allow you to add a timeout using the built in methods of the solver (surprisingly non-trivial to find any info about this, because MathOpt is so new it has no documentation)
I added an additional solver to the preprocessing which fixes the occasional ploidy constraint violation that I observed, which can happen as a result of filtering alignments or mismapped/contaminant reads. The solution I went with was to remove the minimal number of reads that cause infeasibliity. In this diagram you can see that the sample must have 3 haplotypes (paths) assigned to it, which violates the n=2 ploidy constraint:
I finished the Gurobi license application process and we were granted a Web License which allows up to 400 concurrent solvers. It works fairly seamlessly with MathOpt, but it re-checks the license every optimization, which is wasteful. There is no documentation about how to pass an existing Gurobi environment to the backend. I will wait to revisit this until after the other higher priority issues are resolved.
I have started evaluating and doing a parameter grid search on HPRC 47 chr1, and we are getting promising first results (see below) but it is not quite reproducing the prototype results yet, so I need to do some more digging to see why alignment identity is seemingly capped at a low maximum

rlorigro · 2024-08-16T23:08:18Z

After fixing multiple issues, the most egregious of which was forgetting to use the GIAB confident BED uniformly in the experiment, I am getting much better performance. Here are the results for various d_weight in the objective, where d_weight is a scalar that multiples the cost of the edit distance relative to the cost of adding a new haplotype to the solution.

d_weight=1

d_weight=4

d_weight=32

I think there is still room for improvement, so I will continue to investigate the missing haplotype_coverage. @samuelklee since this is nearing a viable output, is there some WDL we should run to see how the downstream performance is for Hapestry on the 47 HPRC samples? It might be interesting to see.

The next major milestone for improving imputation results will probably be the inclusion of small vars, which @fabio-cunial is working on.

samuelklee · 2024-08-17T12:42:48Z

This sounds great! And yes, take a look at PhasedPanelEvaluation—ideally you should just be able to slot your VCF into the joint_sv_vcf input (currently the recall = 0.7 kanpig intra + truvari inter VCF). Let me know if you run into any issues!

rlorigro · 2024-08-18T02:28:27Z

For the purpose of comparing to kanpig is there any way to guarantee equal subsetting of the VCF by a “confident” BED? I don’t want to make the same mistake twice and compare with unequal/missing regions

samuelklee · 2024-08-19T16:27:09Z

The Vcfdist task takes in a BED file, right now we are using GIAB GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed (with a trivial header issue fixed up, I believe), but that can be changed. The overlap metric is calculated over the whole genome, but that could also be subset pretty easily.

rlorigro · 2024-08-19T16:29:10Z

I see, I didn't realize we were skipping all the tandems. Is that because of some issue caused by multiallelics? they are probably the biggest area of improvement for hapestry vs truvari

samuelklee · 2024-08-19T17:09:05Z

I think the choice of that BED file was perhaps arbitrary. I would hope that we expand the Vcfdist task to take in an array of BEDs and stratify. We'll get there!

rlorigro · 2024-08-26T21:56:34Z

It looks like all of your evaluation is on GRCH38 😐

samuelklee · 2024-08-26T21:58:50Z

The workflow itself should be fairly reference agnostic, although there may be one or two resources in the evaluations that aren’t as readily available for CHM.

rlorigro · 2024-08-26T22:11:02Z

Given that there are about 18 fields in the JSON that would need to be changed I think I will wait until I rerun hapestry on GRCH38.

samuelklee · 2024-08-26T22:12:25Z

(Also note from #1 that this was the plan from the start… I’d rather not spend time on doing HPRC-only for two references, given that we’re supposedly getting AoU1 access back tomorrow!)

samuelklee mentioned this issue Aug 1, 2024

Automated evaluation of current end-to-end pipeline. #1

Closed

1 task

samuelklee assigned rlorigro Aug 1, 2024

rlorigro changed the title ~~Finish developing/evaluting hapestry merge~~ Finish developing/evaluating hapestry merge Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish developing/evaluating hapestry merge #15

Finish developing/evaluating hapestry merge #15

rlorigro commented Jul 25, 2024 •

edited

Loading

samuelklee commented Aug 6, 2024

rlorigro commented Aug 6, 2024

rlorigro commented Aug 16, 2024

samuelklee commented Aug 17, 2024

rlorigro commented Aug 18, 2024

samuelklee commented Aug 19, 2024

rlorigro commented Aug 19, 2024

samuelklee commented Aug 19, 2024

rlorigro commented Aug 26, 2024

samuelklee commented Aug 26, 2024

rlorigro commented Aug 26, 2024

samuelklee commented Aug 26, 2024

Finish developing/evaluating hapestry merge #15

Finish developing/evaluating hapestry merge #15

Comments

rlorigro commented Jul 25, 2024 • edited Loading

samuelklee commented Aug 6, 2024

rlorigro commented Aug 6, 2024

rlorigro commented Aug 16, 2024

d_weight=1

d_weight=4

d_weight=32

samuelklee commented Aug 17, 2024

rlorigro commented Aug 18, 2024

samuelklee commented Aug 19, 2024

rlorigro commented Aug 19, 2024

samuelklee commented Aug 19, 2024

rlorigro commented Aug 26, 2024

samuelklee commented Aug 26, 2024

rlorigro commented Aug 26, 2024

samuelklee commented Aug 26, 2024

rlorigro commented Jul 25, 2024 •

edited

Loading