Synthetic Speech Dataset Generator (SpeechGen)

Project under construction. First complete release early May 2024 🚧👷‍♂️

What is missing now:

🟡 In some OS dark theme template of pyqtdarktheme is not found.

🟡 Audio capture mistake (first 0.2 seconds are corrupted, I believe it is due to audio driver latency or python multithread erratic behavior).

🔴 HF export not complete ´, including HF.

Synthetic Speech Dataset Generator (SpeechGen)

With the rising number of ASR/NLP open source projects democratizing the AI human-machine interface comes with the necessity of getting better ASR datasets. <Whisper Temple delves creates a simple-to-use platform to create Synthetic speech datasets, creating pairs of audio and text. Translation powered by faster-whisper ⏩ Synthetic translations can be edited in UI Data viewer. User interface is created using PyQt5 and runs totally on local machine.

Overview

This application serves as a Synthetic Speech Generator, enabling users to transcribe captured audio and manage generated datasets. It provides a user-friendly interface for configuring audio parameters, transcription options, and dataset management.

Features

Audio Capture: Users can capture audio samples with customizable settings such as sample rate and duration.
Transcription: Provides the option to transcribe captured audio into text.
Audio Metadata: Allows to add metadata to dataset, such as audio sample rate and duration.
Dataset Management: Enables users to view, delete, and manage entries in the generated dataset.
Export: Allows exporting of the dataset for further processing or Hugging Face 🤗.

Future Releases

Adding metadata to each dataset entry, audio sample rate, length, or speaker gender and age. ReadMe update with UI screenshot and video

Installation (Experimenta. Not yet complete; will do Pypi and conda install)

First, I suggest you create and activate a new virtual environment using conda or virtenv. Then follow steps ⬇️

Clone the repository:

https://github.com/gongouveia/Syntehtic-Speech-Dataset-Generator.git

Install dependencies in requirements.txt
Follow instruction in Usage Section.

Usage

Launch the application and create or continue a project by running python speech_gen.py --project <project_name> --theme <default:'auto', 'light', 'dark'>,
Configure audio capture parameters such as sample rate in KHz default: 16000 and duration in milliseconds default: 5000.
If CUDA is found, it is possibel to transcribe audio records at the ed of each recording. Otherwise, yo can batch transcribe the audios in the DatasetViewer..
Choose whether to use VAD option in transcripion or not, default is enabled and allows for a faster trancription.
Click on "Capture Audio" to start a new audio recording.
View and manage Audio dataset using provided menu options.
Edit weak Transcriptions, creating a even more robust training dataset for Whisper.

Notes

If the Idiom argument is set to ('en'), the languages dropdown menu is not available. If option 3 is disabled, it is possible to transcribe all the captured audios in the dataset viewer window. You can add audios to the Audio dataset by pasting them in the /Audios folder under your desired project.

Configuration

Audio Sample Rate: Set the sample rate for audio capture (in KHz).
Audio Duration: Define the duration of audio samples to capture (in milliseconds).
Transcribe: Choose whether to transcribe captured audio (Yes/No).
VAD: Enable or disable VAD in transcription (Yes/No).

Dataset Management

View Dataset: Opens a new window to view the generated dataset.
Refresh Dataset: Refreshes dataset, use if changed metadata.csv.
Delete Entry: Deletes the last recorded entry from the dataset.

Exporting Dataset as Hugging Face Audio Dataset

To export the final dataset as a Hugging Face 🤗 Datasets, use the Command-Line Interface (CLI) provided. [https://huggingface.co/docs/datasets/audio_dataset]

You can log in to UI by providing the hf token [https://huggingface.co/docs/hub/security-tokens].

Future releases or on demand solutions:

Dependinfg on community or necessity, this features will be merged:

Adding a new translation engine or more translation configuration options;
Adding more metadata to the Dataset, such as speaker and file type information;
Export as kaldi ☕ dataset format;
Adding a loading bars for the dataset batch translation;
New window to train whisper with the new pseudo-synthetic dataset (on-request, contact me if you need this solution).

Contributing

Contributions to this project are welcome! If you'd like to contribute, please follow the standard GitHub workflow:

Fork the repository.
Create a new branch for your feature (git checkout -b feature/your-feature).
Commit your changes (git commit -am 'Add some feature').
Push to the branch (git push origin feature/your-feature).
Create a new Pull Request.

Author

For any inquiries or collaboration, please contact [[email protected]]. I would be thankful to be cited in created datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
Images		Images
Projects		Projects
SecondaryWindow		SecondaryWindow
Utils		Utils
__pycache__		__pycache__
DatasetViewer.py		DatasetViewer.py
MainWindow.py		MainWindow.py
README.md		README.md
config.json		config.json
main.py		main.py
medium.md		medium.md
req.yml		req.yml
tempCodeRunnerFile.py		tempCodeRunnerFile.py

gongouveia/Whisper-Temple-Synthetic-ASR-Dataset-Generator

Folders and files

Latest commit

History

Repository files navigation

What is missing now:

Synthetic Speech Dataset Generator (SpeechGen)

Overview

Features

Future Releases

Installation (Experimenta. Not yet complete; will do Pypi and conda install)

Usage

Notes

Configuration

Dataset Management

Exporting Dataset as Hugging Face Audio Dataset

Future releases or on demand solutions:

Contributing

Author

About

Topics

Resources

Stars

Watchers

Forks

Languages