Reference graphs are used to represent alternative haplotypes (there is ambiguity in case of unknown phasing).
A reference graph gives alternative sequences on a fixed chromosome.
Each node stores information on
- its start and end on the chromosome (reference coordinates)
- a list of reference modifications (alternative sequences)
- a type (see below)
- optionally, a color (for phasing)
- Unknown: no variance information is known (when getting sequence for this, we will assume that the sequence is equal to the reference, but we may treat these regions differently in comparison / reporting results).
- Homref: the sequence is equal to the genome reference
- Alt: There is an alternative sequence which is different from the reference in this region.
Node colors are used to restrict the paths we traverse in the graph for phased comparisons.
- Red: Node is likely phased to maternal (left in VCF) haplotype
- Blue: Node is likely phased to paternal (right in VCF) haplotype
- Black: Node is present on all haplotypes.
- Grey: Phasing of node is unknown, but it is not present on all of them (unphased het).
We allow edges between nodes
The track for hc.vcf.gz
in this screenshot:
... generates this graph
This package includes the hapenum
tool, which can generate dot files for
VCF regions (input must be gzipped+tabixed):
${hap.py}/bin/hapenum -r reference.fasta input.vcf.gz \
--output-dot graph.dot -l chr1:1-1000 && dot -Tsvg graph.dot > graph.svg
Another test to perform on regions of interest in a VCF is to check which haplotype
pairs (assuming a diploid genome, we must find two non-exclusive paths in the graph
that correspond to two haplotypes) can be generated. The dipenum
tool will show
all haplotype sequence pairs that can be created from a particular VCF locus:
${hap.py}/bin/dipenum -r reference.fasta input.vcf.gz \
-l chr1:1-1000
Finally, hap.py comes with a VCF validation tool that will check REF alleles and test if a sensible set of haplotypes can be enumerated for individual VCF loci.
$ bin/validatevcf -r hg19.fa dtest.vcf.gz
[W] REF allele mismatch with reference sequence at chr1:237636278-237636309 VCF: TTTTTTCCACCTTGCTTTTACTTTTTTTTTTA REF: TTTTTCCACCTTGCTTTTACTTTTTTTTTTAA
[W] Found a variant with more 3 > 2 (max) alt alleles. These become no-calls.
This tool can also output (using the -o
command line option) an annotated VCF
file which will give error annotations for each variant call. By default, validatevcf
will write variants in their input representation. Using the -V 1 -W 1 -L 1
command
line options will have vcfvalidate
output variants in the same representation that
hap.py uses internally for "partial credit" comparisons (i.e it will perform left-shifting
and allele realignment).