Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on running BioTraDIS on multiple contigs #130

Open
Madhuskey1993 opened this issue Nov 30, 2022 · 1 comment
Open

Issues on running BioTraDIS on multiple contigs #130

Madhuskey1993 opened this issue Nov 30, 2022 · 1 comment

Comments

@Madhuskey1993
Copy link

Hi,

We've been running BioTraDIS on a .embl file which contains a chromosome and two plasmids (GCA_000008865.2.txt). On our data the command bacteria_tradis works fine and as would be expected this produces three .insertion_site_plot.gz each one corresponding to either the chromosome or plasmids (SK-1-1-2-5.ENA_AB011548_AB011548.2.insert_site_plot.gz, SK-1-1-2-5.ENA_AB011549_AB011549.2.insert_site_plot.gz, SK-1-1-2-5.ENA_BA000007_BA000007.3.insert_site_plot.gz). Its important to note that each of these files have the same amount of lines each corresponding to the length of the particular contig in bp. However when we come to run tradis_gene_insert_sites for each insert_site_plot file we start to encounter a couple of issues within the tradis_gene_insert_site.csv generated.

An example of the tradis_gene_insert_sites generated files are here (trimmed_1-1-2-5.fq.ENA_AB011549_AB011549.2.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_BA000007_BA000007.3.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_AB011548_AB011548.2.tradis_gene_insert_sites.csv). Where BA000007 is the chromosome and AB011548 + AB011549 are the plasmids.

Issue 1
The first issue that we have encountered is that in the annotations which do not correspond to the particular chormasone or plasmid ran there is data such as read count and insertion indices being generated for some annotations. This is particularly noted in our plasmid files (denoted by the AB) where we see a read count being generated for genes which are present on the chromosome and the other plasmid, which shouldn't be happening. My guess is that the annotation for each contig is being overlayed over the insert_site_plot file creating entries for each contig up to the length of the insert_site_plot file. Our assumption is to ignore the annotations for the other contigs and set these back to 0. Is there anyway to prevent this ?

Issue 2

Secondly, we've noted another issues in regards to annotations where the genomic start and genomic end of a feature span the beginning and end of a DNA sequence. An example of this can be found here in the gene tagA.

<style> </style>
locus_tag gene_name ncrna start end strand read_count ins_index gene_length ins_count fcn
AB011549_1_92527_2502 tagA 0 92527 2502 1 0 0 -90024 0 ToxR-regulated lipoprotein
AB011549_1_2589_3464 etpC 0 2589 3464 1 2954 0.277397 876 243 Type II secretion pathway related protein
AB011549_1_3675_5432 etpD 0 3675 5432 1 7430 0.261092 1758 459 Type II secretion pathway related protein

Here tagA spans the start of the plasmid sequence and really should have a gene length of approximately 2762bp, however generates a negative gene length. In addition because of this no data entered for the gene in question. Is there anyway to solve this?

Thanks for you help

Mat

@lbarquist
Copy link
Contributor

Hi,

Thanks for the detailed report. So, to answer these:

re: 1, I suspect this is because tradis_gene_insert_sites expects an embl file with a single replicon annotation in it. Could you try splitting your embl file into one for each replicon and process these separately with the appropriate plot files to see if this resolves the issue?

re: 2, I think this is a genuine bug, or at least an unimplemented feature -- it's fairly unusual to have a replicon sequence split in the middle of a gene annotation, and it looks like the code just doesn't consider this case in calculating the gene length leading to a nonsensical result. Assuming the above suggestion fixes your problem 1, if you could post an example case with data for one of the plasmids where this happens, I'll try to put in a fix for this. In the meantime, I don't think this should affect the rest of the result table, so as long as the tagA gene isn't your primary interest you can probably just ignore/remove this row and carry on with downstream analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants