Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to predict inserted contamination #11

Open
felipevzps opened this issue Jul 31, 2020 · 5 comments
Open

fail to predict inserted contamination #11

felipevzps opened this issue Jul 31, 2020 · 5 comments

Comments

@felipevzps
Copy link

Hello!

I did a synthetic genome to check the outputs and the conterminator failed to predict inserted contaminants.

Infos:
Version: 1.c74b5
Organisms in this synthetic genome: Saccharum hybrid cultivar SP80-3280, Klebsiella pneumoniae and Acinetobacter baumannii.

History
I inserted the complete A.baumanii and K.pneumoniae genome into the sugarcane genome and created a kraken mapping file (when I checked the mapping file, I could see the ID taxonomy of the inserted items - A.baumani ID = 470, K.pneumoniae ID = 573 and SP80-3280 ID = 193079).

Then, I ran the conterminator with the following command:
conterminator dna synthetic_genome.fasta kraken_mapping_file.txt synthetic_genome_conterminator tmp

Results
The synthetic_genome_conterminator_conterm_prediction is empty.
The synthetic_genome_conterminator_all don't have informations of the inserted contaminants.

Data
synthetic_genome_conterminator_all.txt
kraken_mapping_file.txt
Genome file is too big and the conterm_prediction is empty.

Problem
My objective is to observe contamination in the sugarcane genome. I'm using the conterminator incorrectly or is the conterminator failing to predict contamination?

@martin-steinegger
Copy link
Collaborator

We currently predict contamination just for shore sequences of length < 20kb. The 20kb can be in scaffolds or just single sequences. I assume you have just one long sequence?

@donovan-h-parks
Copy link

@martin-steinegger Is there a way to indicate that contamination should be reported for longer sequences? I'm trying to reproduce the example between C. elegans and E. coli in your ms.

@martin-steinegger
Copy link
Collaborator

The _all report should contain all the local alignments with cross kingdom hits (--kingdom). This could be used to filter for longer sequences. Can you find the C.elegans and E.coli in it? The format is like the following:

1.) Numeric identifier
2.) Sequence identifier
3.) Alignment start
4.) Alignment end
5.) Corrected contig length (length between flanking Ns)
6.) Total sequence length
7.) Kingdom (default: 0: Bacteria&Archaea, 1: Fungi, 2: Metazoa, 3: Viridiplantae, 4: Other Eukaryotes)
8.) Species name 

@donovan-h-parks
Copy link

There are indeed expected hits in the _all file. Is it possible to make the 20 kb filtering criterion an exposed parameter? This would also help document to users that such a criterion exists.

@martin-steinegger
Copy link
Collaborator

Yes, I agree. I had this on my todo list for quite some time. :(
But currently I am quite flooded with work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants