Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate compressed vcf outputs #243

Open
ramprasadn opened this issue Nov 4, 2022 · 19 comments
Open

generate compressed vcf outputs #243

ramprasadn opened this issue Nov 4, 2022 · 19 comments
Assignees
Labels
enhancement Improvement for existing functionality

Comments

@ramprasadn
Copy link
Collaborator

Description of feature

Some of the annotation programs used in the pipeline only generate vcf outputs. It would be good to make changes to those modules so that they can also generate compressed vcf outputs.

@ramprasadn ramprasadn added the enhancement Improvement for existing functionality label Nov 4, 2022
@ramprasadn ramprasadn self-assigned this Nov 4, 2022
@asp8200 asp8200 assigned asp8200 and unassigned ramprasadn Jun 29, 2023
@asp8200
Copy link
Contributor

asp8200 commented Jun 29, 2023

I ran this test

nextflow run main.nf -profile test,docker --outdir results

The only uncompressed vcf-file I could find in the outdir was results/annotate_mt/justhusky_vep_vcfanno_hmtnote_mt_annotated.vcf.

It turns out that currently both the uncompressed and compressed version of the abovementioned vcf-file is getting published. I'll just disable the publishing of the uncompressed vcf-file.

Are there any other uncmopressed vcf-files being published?

@asp8200
Copy link
Contributor

asp8200 commented Jun 29, 2023

I found these files which are a bit large and can be compressed:

results/qc_bam/*.d4
results/qc_bam/*.wig
results/qc_bam/*.bw

Should I try to have them compressed?

@ramprasadn
Copy link
Collaborator Author

As far as I know, compressed versions of these files cannot be used by downstream tools so if users are actively using them, they'd want it uncompressed. These files can always be compressed outside of our pipeline for archiving so I'd leave this as it is.

@asp8200
Copy link
Contributor

asp8200 commented Jun 30, 2023

okay, well, then I can't find any output-files from the raredisease-pipeline that needs to be compressed. Do you know of any?

@ramprasadn
Copy link
Collaborator Author

ramprasadn commented Jun 30, 2023

Not really. I haven't checked, but do you know if tools like vcfanno and svdb query are capable of producing compressed vcf files as outputs? If the tools can't, perhaps we can update the modules with an option to run bgzip on the output so they can produce compressed files? I am thinking a boolean flag like this. What do you think?

@asp8200
Copy link
Contributor

asp8200 commented Jun 30, 2023

I think that neither vcfanno nor svdb-query can output compressed VCF-files.

Brent of vcfanno suggested just piping to compressor tool:
brentp/vcfanno#66

@ramprasadn
Copy link
Collaborator Author

Nice! Perhaps we can modify vcfanno in nf-core/modules (so it has the option to generate compressed output) and then update the pipeline to use that version?

@asp8200
Copy link
Contributor

asp8200 commented Jul 4, 2023

I'm not sure that is the right way to go. (I get the impression that nf-core likes modules to do just one thing, but I could be wrong.)

As far as I can tell, what you are doing now is fine:

VCFANNO_MT(ch_in_vcfanno, ch_vcfanno_toml, [], ch_vcfanno_resources)
// HMTNOTE ANNOTATE
HMTNOTE_ANNOTATE(VCFANNO_MT.out.vcf)
ZIP_TABIX_HMTNOTE(HMTNOTE_ANNOTATE.out.vcf)

No VCF-file is not published from VCFANNO_MT, but instead it is sent to HMTNOTE_ANNOTATE for annotation and then the annotated VCF-file is sent to ZIP_TABIX_HMTNOTE where it gets bgzipped and a corresponding TBI-file. Both the bgzipped annotated VCF-file and the TBI-file then gets published.

@ramprasadn
Copy link
Collaborator Author

Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄

That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.

@asp8200
Copy link
Contributor

asp8200 commented Jul 6, 2023

Hmmm.. I am not certain which way the community swings when it comes to adding functionalities like generating compressing outputs in a module. Perhaps we should bring this up on slack 😄

That's true, but I was thinking that work directory will get bloated with the uncompressed vcf. I do not have experience with cloud services, but maybe this will result in increased costs for the user? These files can easily take up a couple of Gigs, and that can add up over time.

I got the impression that the idea is to delete the work-folder after the succesful completion of the pipeline. Still, I guess one wouldn't want the work-folder to be unnecessary large. Let's see what @maxulysse has to say about this 😊

@maxulysse
Copy link
Member

I'm happy with adding gzip in the module for compression.
We are trying to set up gold standards, and I believe that reducing data footprint is a good idea in any case

@asp8200
Copy link
Contributor

asp8200 commented Jul 6, 2023

Doing some experiments on this. It seems that bgzip isn't available in the container that is used for hmtnote, but gzip is. Is it okay to use gzipor does it have to be bgzip?

@jemten
Copy link
Collaborator

jemten commented Jul 6, 2023

If bgzip isn't available in he container, I would leave it as is. It needs to be bgzipped in order to be indexed and potentially merged back with the SNV vcf.

@maxulysse
Copy link
Member

What about adding tabix as a dependency in the container?

But yeah, if you need to merge it back, you might not want to compress it

@ramprasadn
Copy link
Collaborator Author

Do you mean we create a mulled container?

@maxulysse
Copy link
Member

yeah, that's what I meant if we want to add this functionnality

@ramprasadn
Copy link
Collaborator Author

I second that idea 👍🏻

@asp8200
Copy link
Contributor

asp8200 commented Jul 6, 2023

Aren't mulled containers causing problems and frustration from time to time? Not sure it is worthwhile.

@jemten
Copy link
Collaborator

jemten commented Jul 6, 2023

One could look into adding it to the standard conda recipe of vcfanno to get it into the biocontainer, however it feels a little like we would hijack that conda recipe for our own needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants