Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO for first draft of Phase 1 paper. #52

Open
17 of 18 tasks
samuelklee opened this issue Oct 1, 2024 · 10 comments
Open
17 of 18 tasks

TODO for first draft of Phase 1 paper. #52

samuelklee opened this issue Oct 1, 2024 · 10 comments

Comments

@samuelklee
Copy link
Collaborator

samuelklee commented Oct 1, 2024

@hangsuUNC Run HiPhase on whole genome using current parameters with no filtering for all samples:

  • Break out a per-sample workflow from the current PhysicalAndStatisticalPhasing workflow. This should not scatter over samples in the workflow itself; rather, scatter at the Terra level (seems to scale better with Job Manager, easier to test on a few samples).
  • Run on HPRC+AoU1 and attach per-sample short + SV HiPhase outputs to the data model.
  • Create short + SV HiPhase merged VCFs and place in a documented location in the workspace bucket.
  • Lower priority: refactor PhysicalAndStatisticalPhasing to call into the broken-out per-sample workflow.

@samuelklee Experiment with reducing short-variant count for Shapeit4 while maintaining SV imputation performance (Iris is focusing on AF > 0.05 for eQTL analyses):

  • Finish chr6 evaluation, currently running with 10kb windows and AF >= 0.02. This is still using the old unsharded version of the Shapeit4 workflow.
  • Incorporate windowing and AF filtering into sharded Shapeit4 workflow.
  • Rerun sharded workflow on chr6 to verify performance.

@hangsuUNC Statistical phasing:

  • Break out and run sharded Shapeit4 workflow with windows/filters on whole genome.
  • Place Shapeit4 VCF in a documented location in the workspace bucket.
  • Lower priority: refactor PhysicalAndStatisticalPhasing to call into the broken-out sharded Shapeit4 workflow.

Goal is for phasing to be complete done within the first week or so. Hopefully, HiPhase can be done in the next day or two, at the very least; Shapeit4 could stand some tinkering with cutting variant count, but at some point we should just bite the bullet. Perhaps after running the sharded workflow on a single chromosome and confirming the cost. What do you think, @hangsuUNC?

@samuelklee Main figure (w/ chr6 first, WG when ready):

  • Create leave-out-all-HPRC KAGE+GLIMPSE panel and run evaluation.
  • a) summary statistics of 40 HPRC-in-1kGP samples, comparing numbers of SV insertions/deletions across: 1) HPRC assemblies, 2) 8x LR, 3) imputed SR (vs. leave-out-all-HPRC panel), 4) GATK-SV SR. Stratify by region type as minimally as possible. Perhaps only include common?
  • b) precision and recall metrics vs. HPRC dipcall truth: 1) 8x LR, 2) imputed SR (vs. leave-out-all-HPRC panel), 3) GATK-SV SR. Same region/frequency stratification/cutoff. Vcfdist to start, although this will be <5-10kb; perhaps truvari bench if we decide we want to trust that again.

@samuelklee Generate inputs for eQTL:

  • Create KAGE+GLIMPSE chr6 panel.
  • Impute chr6 panel into 731 MAGE samples.
  • Create KAGE+GLIMPSE WG panel.
  • Impute WG panel into 731 MAGE samples.

@samuelklee Supplementary figures:

  • Expanded versions of main figures, with additional SV length, population, and AF stratification. Perhaps include short variants? If we slim down on short variants, ideally precision is still high but recall may be low vs. HPRC dipcall.

@kvg anything missing?

EDIT: Rather than leave-out-all-HPRC, let's do leave-out-40-HPRC---these are the 40 in TGP with readily available SR.

@hangsuUNC
Copy link
Collaborator

Updates:

  1. Single-sample whole-genome Hiphase test 1:https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/3c4ce29e-e758-4d87-a4d6-e13e66d762ac
    cost: $2.27. cost distribution:
    single_sample_hiphase cost
  2. Single-sample whole-genome Hiphase test2, reduce VM size to 2cpu, 8 GB: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/0e6d1552-3d39-4c3b-b638-49c33ef735b2
    cost: $1.04. Cost distribution:
    single_sample_hiphase cost

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 3, 2024

Thanks for the update! Two things:

  1. Note that one flaw of the cost-estimation script is that it doesn’t actually give the real cost. It doesn’t account for preemption and cost rates for various resources are provided as fixed parameters. But hopefully we can roughly trust the distribution it reports.

  2. If so, this suggests that with the per-sample bcftools view, at best we can hope for about a 50-50 split between subsetting and HiPhase. This means we will end up paying at least $500 just to subset the joint VCF! Surely a bcftools split strategy can beat that?

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 3, 2024

Will add notes on submissions for chr6 evaluations against dipcall truth here:

40 HPRC GATK-SV, w/o realign flags, 5 samples failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/16dd0d83-52da-4e48-b81e-fe51a87a90f7
40 HPRC GATK-SV, w/ realign flags, 1 sample failing: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/5aa1a8f0-8235-4e39-b11f-85266bc02bde

40 HPRC SR vs. full HPRC+AoU1, w/o realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/afd5183c-cc35-4949-a5bc-6f4f61b9e585
40 HPRC SR vs. full HPRC+AoU1, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/c642a3ca-afeb-40b8-a030-f46464867a3d

InputPhasedPanelEvaluation construction/evaluation of HPRC+AoU1-leave-out-40-HPRC, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/eb8b7b5c-4962-4b4f-a7d4-42ec72dd0cbe

40 HPRC LR HPRC+AoU1 filter+concat, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/128c0302-7110-43f7-93e5-95633875ea0d
40 HPRC LR HPRC+AoU1 panel, w/ realign flags: https://app.terra.bio/#workspaces/allofus-drc-wgs-LR-prodPaper/AoU_DRC_WGS_LongReads_Imputation/job_history/ef9cf97c-6342-42e6-a6e7-4772b9882e39

Note that the realign flags seem to make a minimal impact, at least from spot checking; I'll double-check with plots across all samples.

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 3, 2024

Given maintained accuracy seen in #38 (comment), we can proceed with a sharded run with AF>=0.01 + 10kb SV windowing over all of chr6.

To determine more appropriate shards, we can use a run linked in that comment, for which we only ran 2 manually specified 10Mb Shapeit4 shards: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/6e2c5ed2-a102-466e-9dbc-8eae7bae1021

This run contains an filtered+windowed+FilterAndConcat (where FilterAndConcat refers to singleton filtering and short+SV concatenation) VCF over all of chr6, not just the 2 10Mb shards, which is the input to Shapeit4: gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/submissions/6e2c5ed2-a102-466e-9dbc-8eae7bae1021/PhasedPanelEvaluation/bf009a41-e06b-446e-a62b-03373f65dc8c/call-FilterAndConcatVcfs/HPRCAOU.chr6.filter_and_concat.vcf.gz

We can use the GLIMPSE1 chunk tool on this VCF to generate better shards than our naive 10Mb shards:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 10000000 --buffer-size 500000 -O chr6.chunks.tsv

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected] & [email protected]
  * Version       : 1.1.1
  * Run date      : 03/10/2024 - 20:59:14

Files:
  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

Parameters:
  * Seed             : 15052011
  * #Threads   : 1
  * Min. Window size : 10000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (385.52s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Terminal window [0] -buffer:[chr6:72241-19099133] / +buffer:[chr6:72241-19599242] / L=19026893bp / C=126191
  * Terminal window [1] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [2] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Terminal window [3] -buffer:[chr6:42755911-54941581] / +buffer:[chr6:42255897-55441622] / L=12185671bp / C=63095
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Terminal window [4] -buffer:[chr6:54941691-67331114] / +buffer:[chr6:54441197-67831170] / L=12389424bp / C=63095
  * Terminal window [5] -buffer:[chr6:67331412-78630557] / +buffer:[chr6:66796360-79162564] / L=11299146bp / C=63094
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Terminal window [6] -buffer:[chr6:78630635-91152068] / +buffer:[chr6:78130532-91652103] / L=12521434bp / C=63096
  * Terminal window [7] -buffer:[chr6:91152292-104071527] / +buffer:[chr6:90652274-104577123] / L=12919236bp / C=63095
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Terminal window [8] -buffer:[chr6:104071528-118372982] / +buffer:[chr6:103566858-118885012] / L=14301455bp / C=63095
  * Terminal window [9] -buffer:[chr6:118373301-132352709] / +buffer:[chr6:117873097-132988895] / L=13979409bp / C=63094
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Terminal window [10] -buffer:[chr6:132352735-146564675] / +buffer:[chr6:131821158-147064995] / L=14211941bp / C=63098
  * Terminal window [11] -buffer:[chr6:146564679-157241583] / +buffer:[chr6:146055131-157741653] / L=10676905bp / C=63092
  * Terminal window [12] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 13

Total running time = 385 seconds

Cutting the regions from the resulting file yields:

cut -f 3 chr6.chunks.tsv

chr6:72241-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-55441622
chr6:54441197-67831170
chr6:66796360-79162564
chr6:78130532-91652103
chr6:90652274-104577123
chr6:103566858-118885012
chr6:117873097-132988895
chr6:131821158-147064995
chr6:146055131-157741653
chr6:156680356-170744989

Copied to gs://fc-secure-8e5a6fd7-16ae-4796-80ed-8f0463af5ff1/scratch/slee/chr6.chunks.region.tsv.

Did this on a VM manually, we should put it into the workflow before kicking off WG. Probably doesn't make much of a difference since our panel is probably plenty dense, but it's the sort of thing you're supposed to do, so we might as well.

Kicked off the PhasedPanelEvaluationFromHiPhase workflow leaving out 40 HPRC samples here: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/913b51d5-df0b-4919-a6cb-6db485d6c1b9

If the numbers look good, we should run InputPhasedPanelEvaluation using the Shapeit4 result (which is on the full panel) here to generate the full KAGE+GLIMPSE panel, rather than the leave-out. Then we can reimpute MAGE and get feedback on whether this reduced chr6 panel with fewer short variants is acceptable for eQTLs. If so, then we can proceed to WG. Alternatively, if the costs start looking more reasonable without HiPhase, then we can go ahead with the unfiltered/windowed panel.

UPDATE: The second shard consistently fails with this scheme, even going up to a very underutilized 96GB (see https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/080bbe5c-ef87-4fd7-955c-62bd1ec3ba21 and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/bd097085-f637-4579-a787-5720d27ae440). This shard has the most variants at ~140k, although a couple of others have ~120-130k and succeed. Others have ~60k.

Bumping down the min shard size to 5Mb yields:

./GLIMPSE_chunk_static -I HPRCAOU.chr6.filter_and_concat.vcf.gz --region chr6 --window-size 5000000 --buffer-size 500000 -O chr6.chunks.tsv --thread 4

[GLIMPSE] Split chromosomes into chunks
  * Author        : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : [email protected] & [email protected]
  * Version       : 1.1.1
  * Run date      : 04/10/2024 - 02:32:49

Files:
  * Input VCF      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * Chromosome     : [chr6]
  * Output file    : [chr6.chunks.tsv]

Parameters:
  * Seed             : 15052011
  * #Threads   : 4
  * Min. Window size : 5000000bp / 1000 variants
  * Min. Buffer size : 500000bp / 100 variants

Reading input files
  * Main      : [HPRCAOU.chr6.filter_and_concat.vcf.gz]
  * #variants = 1009520 (323.05s)

Splitting data into chunks and writting to [chr6.chunks.tsv]
  * Internal window [chr6:72241-170744989] / L=170672749bp / C=1009520
  * Internal window [chr6:72241-78630557] / L=78558317bp / C=504761
  * Internal window [chr6:72241-32641877] / L=32569637bp / C=252381
  * Internal window [chr6:72241-19099133] / L=19026893bp / C=126191
  * Terminal window [0] -buffer:[chr6:72241-8411080] / +buffer:[chr6:72241-8911239] / L=8338840bp / C=63096
  * Internal window [chr6:8411530-19099133] / L=10687604bp / C=63095
  * Terminal window [1] -buffer:[chr6:8411530-13529316] / +buffer:[chr6:7911474-14029318] / L=5117787bp / C=31548
  * Terminal window [2] -buffer:[chr6:13529476-19099133] / +buffer:[chr6:13026927-19599242] / L=5569658bp / C=31547
  * Terminal window [3] -buffer:[chr6:19099170-32641877] / +buffer:[chr6:18577569-33141881] / L=13542708bp / C=126190
  * Internal window [chr6:32641878-78630557] / L=45988680bp / C=252380
  * Internal window [chr6:32641878-54941581] / L=22299704bp / C=126191
  * Terminal window [4] -buffer:[chr6:32641878-42755864] / +buffer:[chr6:32070923-43280008] / L=10113987bp / C=63096
  * Internal window [chr6:42755911-54941581] / L=12185671bp / C=63095
  * Terminal window [5] -buffer:[chr6:42755911-48786404] / +buffer:[chr6:42255897-49306975] / L=6030494bp / C=31548
  * Terminal window [6] -buffer:[chr6:48786493-54941581] / +buffer:[chr6:48259536-55441622] / L=6155089bp / C=31547
  * Internal window [chr6:54941691-78630557] / L=23688867bp / C=126189
  * Internal window [chr6:54941691-67331114] / L=12389424bp / C=63095
  * Terminal window [7] -buffer:[chr6:54941691-61954841] / +buffer:[chr6:54441197-62483380] / L=7013151bp / C=31551
  * Terminal window [8] -buffer:[chr6:61954846-67331114] / +buffer:[chr6:61454804-67831170] / L=5376269bp / C=31544
  * Internal window [chr6:67331412-78630557] / L=11299146bp / C=63094
  * Terminal window [9] -buffer:[chr6:67331412-73228807] / +buffer:[chr6:66796360-73751975] / L=5897396bp / C=31548
  * Terminal window [10] -buffer:[chr6:73229192-78630557] / +buffer:[chr6:72728849-79162564] / L=5401366bp / C=31546
  * Internal window [chr6:78630635-170744989] / L=92114355bp / C=504759
  * Internal window [chr6:78630635-132352709] / L=53722075bp / C=252380
  * Internal window [chr6:78630635-104071527] / L=25440893bp / C=126191
  * Internal window [chr6:78630635-91152068] / L=12521434bp / C=63096
  * Terminal window [11] -buffer:[chr6:78630635-84669766] / +buffer:[chr6:78130532-85169973] / L=6039132bp / C=31550
  * Terminal window [12] -buffer:[chr6:84669877-91152068] / +buffer:[chr6:84168043-91652103] / L=6482192bp / C=31546
  * Internal window [chr6:91152292-104071527] / L=12919236bp / C=63095
  * Terminal window [13] -buffer:[chr6:91152292-97518137] / +buffer:[chr6:90652274-98065574] / L=6365846bp / C=31548
  * Terminal window [14] -buffer:[chr6:97518139-104071527] / +buffer:[chr6:97018004-104577123] / L=6553389bp / C=31547
  * Internal window [chr6:104071528-132352709] / L=28281182bp / C=126189
  * Internal window [chr6:104071528-118372982] / L=14301455bp / C=63095
  * Terminal window [15] -buffer:[chr6:104071528-110454782] / +buffer:[chr6:103566858-110954891] / L=6383255bp / C=31548
  * Terminal window [16] -buffer:[chr6:110454783-118372982] / +buffer:[chr6:109925816-118885012] / L=7918200bp / C=31547
  * Internal window [chr6:118373301-132352709] / L=13979409bp / C=63094
  * Terminal window [17] -buffer:[chr6:118373301-125099407] / +buffer:[chr6:117873097-125599569] / L=6726107bp / C=31548
  * Terminal window [18] -buffer:[chr6:125099514-132352709] / +buffer:[chr6:124599490-132988895] / L=7253196bp / C=31546
  * Internal window [chr6:132352735-170744989] / L=38392255bp / C=252379
  * Internal window [chr6:132352735-157241583] / L=24888849bp / C=126190
  * Internal window [chr6:132352735-146564675] / L=14211941bp / C=63098
  * Terminal window [19] -buffer:[chr6:132352735-138960888] / +buffer:[chr6:131821158-139482591] / L=6608154bp / C=31550
  * Terminal window [20] -buffer:[chr6:138960971-146564675] / +buffer:[chr6:138460884-147064995] / L=7603705bp / C=31548
  * Internal window [chr6:146564679-157241583] / L=10676905bp / C=63092
  * Terminal window [21] -buffer:[chr6:146564679-151991440] / +buffer:[chr6:146055131-152491564] / L=5426762bp / C=31547
  * Terminal window [22] -buffer:[chr6:151991503-157241583] / +buffer:[chr6:151491219-157741653] / L=5250081bp / C=31545
  * Terminal window [23] -buffer:[chr6:157241819-170744989] / +buffer:[chr6:156680356-170744989] / L=13503171bp / C=126189
  * #chunks = 24

Total running time = 323 seconds

chr6:72241-8911239
chr6:7911474-14029318
chr6:13026927-19599242
chr6:18577569-33141881
chr6:32070923-43280008
chr6:42255897-49306975
chr6:48259536-55441622
chr6:54441197-62483380
chr6:61454804-67831170
chr6:66796360-73751975
chr6:72728849-79162564
chr6:78130532-85169973
chr6:84168043-91652103
chr6:90652274-98065574
chr6:97018004-104577123
chr6:103566858-110954891
chr6:109925816-118885012
chr6:117873097-125599569
chr6:124599490-132988895
chr6:131821158-139482591
chr6:138460884-147064995
chr6:146055131-152491564
chr6:151491219-157741653
chr6:156680356-170744989

UPDATE: Well, that didn't help, since variant count didn't drop in that failing shard---most likely this is HLA. Just went back to the original 13 shards and cranked up to 128GB. It might be a good idea to have a strategy for tuning runtimes to a particular sharding scheme at some point, if OOM retry continues to be so unreliable.

Completed:
PhasedPanelEvaluationFromHiPhase (only failed input short evaluation): https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/3063a8e4-93a8-424a-81c4-520699d4c941
InputPhasedPanelEvaluation continuation: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/73371f89-5a5b-4b5b-bafd-57f9ab5de853

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 3, 2024

Once these runs complete, I think we have the following cohorts for comparison on 40 HPRC samples, which should provide a basis for plots of summary statistics and accuracy vs. dipcall:

  • LR filter+concat (unimputed)
  • LR panel (imputed)
  • LR filtered/windowed filter+concat (unimputed)
  • LR filtered/windowed panel (imputed)
  • SR vs. full (we don't need to show this in the paper, since that's the whole point of doing leave-out)
  • SR vs. leave-out-40
  • SR vs. filtered/windowed leave-out-40
  • SR GATK-SV

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 7, 2024

Notes for resolving GATK-SV symbolic alleles:

python ~/Downloads/resolve_light.py 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.vcf.gz /mnt/4AB658D7B658C4DB/working/ref/Homo_sapiens_assembly38.fasta | bgzip -c > 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.vcf.gz
bbcftools query -f '%CHROM\t%POS\t%ID\t%REFSEQ\t%INSSEQ\n' 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.vcf.gz  | bgzip -c > 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz
tabix -s1 -b2 -e2 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz
echo "##INFO=<ID=INSSEQ,Number=.,Type=String,Description=\"Sequence of insertion\">" > header.txt
echo "##INFO=<ID=REFSEQ,Number=.,Type=String,Description="Original reference sequence">" >> header.txt
bcftools annotate 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.vcf.gz -a 1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.chr6.resolve_light_insseq.annot.tsv.gz -c CHROM,POS,~ID,REFSEQ,INSSEQ -h header.txt | bgzip -c > 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.annot.vcf.gz
python ~/Downloads/copy_seq.py 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.annot.vcf.gz /mnt/4AB658D7B658C4DB/working/ref/Homo_sapiens_assembly38.fasta | bgzip -c > 1kGP_high_coverage_Illumina.chr6.filtered.SNV_INDEL_SV_phased_panel.resolved.vcf.gz

Preliminary version of plot revealed only short-variant alleles are resolved, yielding ~10% SV recall: https://docs.google.com/presentation/d/1z5TFuvydlCBbskWCsGzLnloHJKkiF1U9AIr8RfxM1TQ/edit?usp=sharing

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 7, 2024

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 9, 2024

@hangsuUNC
Copy link
Collaborator

HPRC + AoU 1074 sample whole genome hiphased vcfs have been completed and pasted to the terra work table. The cost varies from 0.1-0.9$ per sample. Currently merging it to a multi-sample filtered vcfs.

@samuelklee
Copy link
Collaborator Author

samuelklee commented Oct 24, 2024

Will come back and write a status update now that physical+statistical phasing is complete, but see status slide from today.

Hit a snafu when building the KAGE panel. The KAGE obgraph package didn't like a site at chr7:57336990 and threw an exception:

chr7	57336989	.	T	TG	.	PASS	ID=chr7-57336989-allele618843-2
chr7	57336990	.	GT	GG	.	PASS	ID=chr7-57336991-allele618845-1

Surprised this is the only instance in ~20M variants. Not quite sure why it's problematic, but it may have something to do with the fact that it's not atomized---in which case, not quite sure why the upstream bcftools norm -a in PanGeniePanelCreation didn't take care of it.

Will manually remove and rerun for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants