fail to predict inserted contamination #11

felipevzps · 2020-07-31T09:02:08Z

Hello!

I did a synthetic genome to check the outputs and the conterminator failed to predict inserted contaminants.

Infos:
Version: 1.c74b5
Organisms in this synthetic genome: Saccharum hybrid cultivar SP80-3280, Klebsiella pneumoniae and Acinetobacter baumannii.

History
I inserted the complete A.baumanii and K.pneumoniae genome into the sugarcane genome and created a kraken mapping file (when I checked the mapping file, I could see the ID taxonomy of the inserted items - A.baumani ID = 470, K.pneumoniae ID = 573 and SP80-3280 ID = 193079).

Then, I ran the conterminator with the following command:
conterminator dna synthetic_genome.fasta kraken_mapping_file.txt synthetic_genome_conterminator tmp

Results
The synthetic_genome_conterminator_conterm_prediction is empty.
The synthetic_genome_conterminator_all don't have informations of the inserted contaminants.

Data
synthetic_genome_conterminator_all.txt
kraken_mapping_file.txt
Genome file is too big and the conterm_prediction is empty.

Problem
My objective is to observe contamination in the sugarcane genome. I'm using the conterminator incorrectly or is the conterminator failing to predict contamination?

The text was updated successfully, but these errors were encountered:

martin-steinegger · 2020-08-01T05:35:28Z

We currently predict contamination just for shore sequences of length < 20kb. The 20kb can be in scaffolds or just single sequences. I assume you have just one long sequence?

donovan-h-parks · 2021-11-09T00:53:51Z

@martin-steinegger Is there a way to indicate that contamination should be reported for longer sequences? I'm trying to reproduce the example between C. elegans and E. coli in your ms.

martin-steinegger · 2021-11-09T09:29:02Z

The _all report should contain all the local alignments with cross kingdom hits (--kingdom). This could be used to filter for longer sequences. Can you find the C.elegans and E.coli in it? The format is like the following:

1.) Numeric identifier
2.) Sequence identifier
3.) Alignment start
4.) Alignment end
5.) Corrected contig length (length between flanking Ns)
6.) Total sequence length
7.) Kingdom (default: 0: Bacteria&Archaea, 1: Fungi, 2: Metazoa, 3: Viridiplantae, 4: Other Eukaryotes)
8.) Species name

donovan-h-parks · 2021-11-09T15:03:08Z

There are indeed expected hits in the _all file. Is it possible to make the 20 kb filtering criterion an exposed parameter? This would also help document to users that such a criterion exists.

martin-steinegger · 2021-11-09T15:42:24Z

Yes, I agree. I had this on my todo list for quite some time. :(
But currently I am quite flooded with work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fail to predict inserted contamination #11

fail to predict inserted contamination #11

felipevzps commented Jul 31, 2020

martin-steinegger commented Aug 1, 2020

donovan-h-parks commented Nov 9, 2021

martin-steinegger commented Nov 9, 2021

donovan-h-parks commented Nov 9, 2021

martin-steinegger commented Nov 9, 2021

fail to predict inserted contamination #11

fail to predict inserted contamination #11

Comments

felipevzps commented Jul 31, 2020

martin-steinegger commented Aug 1, 2020

donovan-h-parks commented Nov 9, 2021

martin-steinegger commented Nov 9, 2021

donovan-h-parks commented Nov 9, 2021

martin-steinegger commented Nov 9, 2021