-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate compressed vcf outputs #243
Comments
I ran this test
The only uncompressed vcf-file I could find in the outdir was It turns out that currently both the uncompressed and compressed version of the abovementioned vcf-file is getting published. I'll just disable the publishing of the uncompressed vcf-file. Are there any other uncmopressed vcf-files being published? |
I found these files which are a bit large and can be compressed:
Should I try to have them compressed? |
As far as I know, compressed versions of these files cannot be used by downstream tools so if users are actively using them, they'd want it uncompressed. These files can always be compressed outside of our pipeline for archiving so I'd leave this as it is. |
okay, well, then I can't find any output-files from the raredisease-pipeline that needs to be compressed. Do you know of any? |
Not really. I haven't checked, but do you know if tools like vcfanno and svdb query are capable of producing compressed vcf files as outputs? If the tools can't, perhaps we can update the modules with an option to run bgzip on the output so they can produce compressed files? I am thinking a boolean flag like this. What do you think? |
I think that neither vcfanno nor svdb-query can output compressed VCF-files. Brent of vcfanno suggested just piping to compressor tool: |
Nice! Perhaps we can modify vcfanno in nf-core/modules (so it has the option to generate compressed output) and then update the pipeline to use that version? |
I'm not sure that is the right way to go. (I get the impression that nf-core likes modules to do just one thing, but I could be wrong.) As far as I can tell, what you are doing now is fine: raredisease/subworkflows/local/mitochondria/merge_annotate_MT.nf Lines 109 to 113 in fdfb4a7
No VCF-file is not published from |
Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄 That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time. |
I got the impression that the idea is to delete the work-folder after the succesful completion of the pipeline. Still, I guess one wouldn't want the work-folder to be unnecessary large. Let's see what @maxulysse has to say about this 😊 |
I'm happy with adding gzip in the module for compression. |
Doing some experiments on this. It seems that |
If bgzip isn't available in he container, I would leave it as is. It needs to be bgzipped in order to be indexed and potentially merged back with the SNV vcf. |
What about adding tabix as a dependency in the container? But yeah, if you need to merge it back, you might not want to compress it |
Do you mean we create a mulled container? |
yeah, that's what I meant if we want to add this functionnality |
I second that idea 👍🏻 |
Aren't mulled containers causing problems and frustration from time to time? Not sure it is worthwhile. |
One could look into adding it to the standard conda recipe of vcfanno to get it into the biocontainer, however it feels a little like we would hijack that conda recipe for our own needs. |
Description of feature
Some of the annotation programs used in the pipeline only generate vcf outputs. It would be good to make changes to those modules so that they can also generate compressed vcf outputs.
The text was updated successfully, but these errors were encountered: