-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TODO for first draft of Phase 1 paper. #52
Comments
Updates:
|
Thanks for the update! Two things:
|
Given maintained accuracy seen in #38 (comment), we can proceed with a sharded run with AF>=0.01 + 10kb SV windowing over all of chr6. To determine more appropriate shards, we can use a run linked in that comment, for which we only ran 2 manually specified 10Mb Shapeit4 shards: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6e2c5ed2-a102-466e-9dbc-8eae7bae1021 This run contains an filtered+windowed+FilterAndConcat (where FilterAndConcat refers to singleton filtering and short+SV concatenation) VCF over all of chr6, not just the 2 10Mb shards, which is the input to Shapeit4: We can use the GLIMPSE1 chunk tool on this VCF to generate better shards than our naive 10Mb shards:
Cutting the regions from the resulting file yields:
Copied to Did this on a VM manually, we should put it into the workflow before kicking off WG. Probably doesn't make much of a difference since our panel is probably plenty dense, but it's the sort of thing you're supposed to do, so we might as well. Kicked off the PhasedPanelEvaluationFromHiPhase workflow leaving out 40 HPRC samples here: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/913b51d5-df0b-4919-a6cb-6db485d6c1b9 If the numbers look good, we should run InputPhasedPanelEvaluation using the Shapeit4 result (which is on the full panel) here to generate the full KAGE+GLIMPSE panel, rather than the leave-out. Then we can reimpute MAGE and get feedback on whether this reduced chr6 panel with fewer short variants is acceptable for eQTLs. If so, then we can proceed to WG. Alternatively, if the costs start looking more reasonable without HiPhase, then we can go ahead with the unfiltered/windowed panel. UPDATE: The second shard consistently fails with this scheme, even going up to a very underutilized 96GB (see https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/080bbe5c-ef87-4fd7-955c-62bd1ec3ba21 and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/bd097085-f637-4579-a787-5720d27ae440). This shard has the most variants at ~140k, although a couple of others have ~120-130k and succeed. Others have ~60k. Bumping down the min shard size to 5Mb yields:
UPDATE: Well, that didn't help, since variant count didn't drop in that failing shard---most likely this is HLA. Just went back to the original 13 shards and cranked up to 128GB. It might be a good idea to have a strategy for tuning runtimes to a particular sharding scheme at some point, if OOM retry continues to be so unreliable. Completed: |
Once these runs complete, I think we have the following cohorts for comparison on 40 HPRC samples, which should provide a basis for plots of summary statistics and accuracy vs. dipcall:
|
Notes for resolving GATK-SV symbolic alleles:
Preliminary version of plot revealed only short-variant alleles are resolved, yielding ~10% SV recall: https://docs.google.com/presentation/d/1z5TFuvydlCBbskWCsGzLnloHJKkiF1U9AIr8RfxM1TQ/edit?usp=sharing |
40 HPRC GATK-SV resolved, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/30beaa70-385c-4ef8-ba9b-05c70446732f |
HPRC + AoU 1074 sample whole genome hiphased vcfs have been completed and pasted to the terra work table. The cost varies from 0.1-0.9$ per sample. Currently merging it to a multi-sample filtered vcfs. |
Will come back and write a status update now that physical+statistical phasing is complete, but see status slide from today. Hit a snafu when building the KAGE panel. The KAGE obgraph package didn't like a site at
Surprised this is the only instance in ~20M variants. Not quite sure why it's problematic, but it may have something to do with the fact that it's not atomized---in which case, not quite sure why the upstream Will manually remove and rerun for now. |
@hangsuUNC Run HiPhase on whole genome using current parameters with no filtering for all samples:
@samuelklee Experiment with reducing short-variant count for Shapeit4 while maintaining SV imputation performance (Iris is focusing on AF > 0.05 for eQTL analyses):
@hangsuUNC Statistical phasing:
Goal is for phasing to be complete done within the first week or so. Hopefully, HiPhase can be done in the next day or two, at the very least; Shapeit4 could stand some tinkering with cutting variant count, but at some point we should just bite the bullet. Perhaps after running the sharded workflow on a single chromosome and confirming the cost. What do you think, @hangsuUNC?
@samuelklee Main figure (w/ chr6 first, WG when ready):
@samuelklee Generate inputs for eQTL:
@samuelklee Supplementary figures:
@kvg anything missing?
EDIT: Rather than leave-out-all-HPRC, let's do leave-out-40-HPRC---these are the 40 in TGP with readily available SR.
The text was updated successfully, but these errors were encountered: