Simple Dataset generator? #122

MethanJess · 2024-04-02T15:10:02Z

Hi, I already know there's Speech Dataset Generator
However, it's way too bloated with features and I couldn't get it to work on my system.

So, does anyone have a simple script that splits an audio file into segments, and converts the audio into to the right sample rate, then uses WhisperX large-v3 to transcribe the segments to make "sample_dataset.csv", and "sample_val_dataset.csv"? (and anything else if there's any)?

I tried making my own but I have no idea how to make the validation file thing...

vatsalaggarwal · 2024-04-03T10:14:46Z

the validation file should have the same format as sample_dataset.csv ... once you generate a whole dataset, and have split it into a large training set and small validation set manually, you can then place respective file ids into the csvs

MethanJess · 2024-04-09T20:46:47Z

@vatsalaggarwal Really not sure what that means...
but I've heard that some contributors of this project (@lucapericlp and @danablend) have their own dataset generator, would it be fine if they could share theirs? (or anyone else?)

lucapericlp · 2024-05-14T22:09:08Z

Hey @MethanJess, sorry for the late reply, I've just followed a similar process as pointed out by @vatsalaggarwal for putting together the datasets but I don't have any special generators of my own. If you're running into any issues in putting together a useful data pipeline, let us know & we'll see if we can help!

MethanJess · 2024-05-15T00:46:59Z

Hey @lucapericlp I found this repository: https://github.com/daswer123/xtts-webui
It has a dataset generator that split audio and transcribes it making a transcription of each audio segment, and a validation file.
This was made for Coqui, but the format it creates is very similar to the one of MetaVoice, just a little bit of editing and it would work! right?

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Dataset generator? #122

Simple Dataset generator? #122

MethanJess commented Apr 2, 2024

vatsalaggarwal commented Apr 3, 2024

MethanJess commented Apr 9, 2024 •

edited

This comment was marked as off-topic.

lucapericlp commented May 14, 2024

MethanJess commented May 15, 2024

Simple Dataset generator? #122

Simple Dataset generator? #122

Comments

MethanJess commented Apr 2, 2024

vatsalaggarwal commented Apr 3, 2024

MethanJess commented Apr 9, 2024 • edited

This comment was marked as off-topic.

lucapericlp commented May 14, 2024

MethanJess commented May 15, 2024

MethanJess commented Apr 9, 2024 •

edited