Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefixsplit function #168

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Add prefixsplit function #168

wants to merge 4 commits into from

Conversation

Tixii
Copy link

@Tixii Tixii commented Feb 22, 2021

This adds the ability to read a fastq/fasta file and split the file based on the prefix of each read to enable faster sorting of read sets.

Usage: seqtk prefixsplit [options] <output_filename> <in.fa>
Options:
-p INT length of prefix
-A force FASTA output (discard quality)
-C drop comments at the header lines

It will create files for each prefix of the specified length, e.g.
output_filename.AA.fa
output_filename.AC.fa
....
plus a single file that contains those reads with an N at any position in the prefix:
output_filename.N.fa

Currently only prefix lengths of 1, 2, or 3 are possible, as I felt that creating more than 64 files wouldnt be useful.

There are options to remove the quality scores and drop comments using the same methods as the seqtk seq function.

I have tried to stick to the coding format of the rest of the file, however, this is my first time coding in C and therefore I am sure there are improvements that could be made.

Unknown added 4 commits February 22, 2021 13:04
This adds the ability to read a fastq/fasta file and split the file based on the prefix of each read to enable fasting sorting of read sets. 

Usage: seqtk prefixsplit [options] <output_filename> <in.fa>
Options:
-p INT    length of prefix 
-A        force FASTA output (discard quality)
-C        drop comments at the header lines

It will create files for each prefix of the specified length, e.g. 
output_filename.AA.fa 
output_filename.AC.fa 
.... 
plus a single file that contains those with an N in the prefix: 
output_filename.N.fa 

There are options to remove the quality scores and drop comments using the same methods as the seqtk seq function. 

I have tried to stick to the coding format of the rest of the file, however, this is my first time coding in C and therefore I am sure there are improvements that could be made.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant