Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Dataset generator? #122

Open
MethanJess opened this issue Apr 2, 2024 · 5 comments
Open

Simple Dataset generator? #122

MethanJess opened this issue Apr 2, 2024 · 5 comments

Comments

@MethanJess
Copy link

Hi, I already know there's Speech Dataset Generator
However, it's way too bloated with features and I couldn't get it to work on my system.

So, does anyone have a simple script that splits an audio file into segments, and converts the audio into to the right sample rate, then uses WhisperX large-v3 to transcribe the segments to make "sample_dataset.csv", and "sample_val_dataset.csv"? (and anything else if there's any)?

I tried making my own but I have no idea how to make the validation file thing...

@vatsalaggarwal
Copy link
Contributor

the validation file should have the same format as sample_dataset.csv ... once you generate a whole dataset, and have split it into a large training set and small validation set manually, you can then place respective file ids into the csvs

@MethanJess
Copy link
Author

MethanJess commented Apr 9, 2024

@vatsalaggarwal Really not sure what that means...
but I've heard that some contributors of this project (@lucapericlp and @danablend) have their own dataset generator, would it be fine if they could share theirs? (or anyone else?)

@Vijayvk9092

This comment was marked as off-topic.

@lucapericlp
Copy link
Contributor

Hey @MethanJess, sorry for the late reply, I've just followed a similar process as pointed out by @vatsalaggarwal for putting together the datasets but I don't have any special generators of my own. If you're running into any issues in putting together a useful data pipeline, let us know & we'll see if we can help!

@MethanJess
Copy link
Author

Hey @lucapericlp I found this repository: https://github.com/daswer123/xtts-webui
It has a dataset generator that split audio and transcribes it making a transcription of each audio segment, and a validation file.
This was made for Coqui, but the format it creates is very similar to the one of MetaVoice, just a little bit of editing and it would work! right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants