-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues on running BioTraDIS on multiple contigs #130
Comments
Hi, Thanks for the detailed report. So, to answer these: re: 1, I suspect this is because tradis_gene_insert_sites expects an embl file with a single replicon annotation in it. Could you try splitting your embl file into one for each replicon and process these separately with the appropriate plot files to see if this resolves the issue? re: 2, I think this is a genuine bug, or at least an unimplemented feature -- it's fairly unusual to have a replicon sequence split in the middle of a gene annotation, and it looks like the code just doesn't consider this case in calculating the gene length leading to a nonsensical result. Assuming the above suggestion fixes your problem 1, if you could post an example case with data for one of the plasmids where this happens, I'll try to put in a fix for this. In the meantime, I don't think this should affect the rest of the result table, so as long as the tagA gene isn't your primary interest you can probably just ignore/remove this row and carry on with downstream analysis. |
Hi,
We've been running BioTraDIS on a .embl file which contains a chromosome and two plasmids (GCA_000008865.2.txt). On our data the command bacteria_tradis works fine and as would be expected this produces three .insertion_site_plot.gz each one corresponding to either the chromosome or plasmids (SK-1-1-2-5.ENA_AB011548_AB011548.2.insert_site_plot.gz, SK-1-1-2-5.ENA_AB011549_AB011549.2.insert_site_plot.gz, SK-1-1-2-5.ENA_BA000007_BA000007.3.insert_site_plot.gz). Its important to note that each of these files have the same amount of lines each corresponding to the length of the particular contig in bp. However when we come to run tradis_gene_insert_sites for each insert_site_plot file we start to encounter a couple of issues within the tradis_gene_insert_site.csv generated.
An example of the tradis_gene_insert_sites generated files are here (trimmed_1-1-2-5.fq.ENA_AB011549_AB011549.2.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_BA000007_BA000007.3.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_AB011548_AB011548.2.tradis_gene_insert_sites.csv). Where BA000007 is the chromosome and AB011548 + AB011549 are the plasmids.
Issue 1
The first issue that we have encountered is that in the annotations which do not correspond to the particular chormasone or plasmid ran there is data such as read count and insertion indices being generated for some annotations. This is particularly noted in our plasmid files (denoted by the AB) where we see a read count being generated for genes which are present on the chromosome and the other plasmid, which shouldn't be happening. My guess is that the annotation for each contig is being overlayed over the insert_site_plot file creating entries for each contig up to the length of the insert_site_plot file. Our assumption is to ignore the annotations for the other contigs and set these back to 0. Is there anyway to prevent this ?
Issue 2
Secondly, we've noted another issues in regards to annotations where the genomic start and genomic end of a feature span the beginning and end of a DNA sequence. An example of this can be found here in the gene tagA.
<style> </style>Here tagA spans the start of the plasmid sequence and really should have a gene length of approximately 2762bp, however generates a negative gene length. In addition because of this no data entered for the gene in question. Is there anyway to solve this?
Thanks for you help
Mat
The text was updated successfully, but these errors were encountered: