Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pangenome: tree, gene presence/absence with gggenome like representation #202

Open
tahsinkhann opened this issue Oct 12, 2024 · 3 comments

Comments

@tahsinkhann
Copy link

tahsinkhann commented Oct 12, 2024

Hi, I want to a create pangenome plot for bacteriophages with phylogenetic tree (left side) + gene presence/absence (right side) + gggenome plot that has all ORFs/CDS (in top of presence/absence plot)

I have a tree in nwk format, gene_presence_absence.csv from roary/panaroo. All I need is to combine these with a gggenome like pangenome representation in top of gene_presence_absence matrix.

Lastly, thanks for this amaizing gggenomes tool, it benifited me highly.

@thackl
Copy link
Owner

thackl commented Oct 13, 2024

The easiest way is probably to use https://yulab-smu.top/pkgdocs/aplot.html. Check out section 4.2.2 Aligning plots with a tree. Does that help?

@tahsinkhann
Copy link
Author

Thanks for the response. The link was useful, learned new concepts. However, is it possible to generate something like the draft figure I attached.
tree_gggenomes_pangenome

in the figure
a) tree
b) gene_presence_absence.csv from roary/panaroo
c) all CDS in a pangenome generated by gggenomes (???)

basically I am thinking of an alternative of the attached circular pangenome figure (I don't know how the authors made this figure, nothing was mentioned about the methods in theri paper)
image

Thanks

@thackl
Copy link
Owner

thackl commented Oct 14, 2024

OK, here's some code that I think goes into the direction that you are looking for, and obviously with room to make it prettier. That said, I think there some conceptual limitations here. All you are showing is relative to the picked reference genome, i.e. it does not show any genes that are not present in the reference and it does not account for rearrangements...

library(tidyverse)
library(ggtree)
library(gggenomes)
library(aplot)

# get a random tree
library(ape)
set.seed(2017)
tree <- rtree(4, tip.label = LETTERS[1:4])
p_tree <- ggtree(tree) + geom_tiplab()
p_tree

image

# get some random gene coords
genes <- filter(emale_genes, seq_id == "Cflag_017B") |> 
  transmute(seq_id = "A", start, end, feat_id, note=Note)

# reference genes - NOTE: any gene that is not present in the reference will not be shown in the plot!
p_genes <- gggenomes(genes, infer_start = 0) +
  geom_gene(aes(fill=note)) + geom_bin_label()
p_genes

image

# get random presence/absence for all reference genes and all genomes
presence <- map(LETTERS[1:4], function(id){
  tibble(
    seq_id = id,
    feat_id = genes$feat_id,
    presence = sample(0:1, length(genes$feat_id), replace = T, prob=c(0.2, 0.8)))
}) |> bind_rows()

ggplot(presence) + geom_raster(aes(feat_id, seq_id, fill=as.logical(presence)))

image

# we need to merge p/a with gene start/end and establish the same sorting order as in tree
tree_label_order <- p_tree$data |> filter(isTip) |> arrange(, y) |> pull(label)
genes_presence <- presence |> 
  left_join(select(genes, -seq_id)) |>  # ignore reference seq_id ("A") to create records for "A-D"
  mutate(
    width = width(start, end),
    seq_id = factor(seq_id, levels=tree_label_order)) # reorder based on tree

# use geom_tile over geom_raster as it allows boxes of different sizes
p_presence <- ggplot(genes_presence) +
  geom_tile(aes(width = width, height = .9, x = (start+end)/2, y = seq_id, fill=as.logical(presence)))
p_presence

image

# combine plots
p_presence |> insert_left(p_tree, width=.2) |> insert_top(p_genes, height=.2)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants