Skip to content

(Still not complete!!) This UI serves as a Synthetic ASR Dataset Generator powered by/for OpenAI Whisper, enabling users to capture audio, transcribing it, on the fly and manage the generated dataset. It provides a user-friendly interface for configuring audio parameters, transcription options, and dataset management.

Notifications You must be signed in to change notification settings

gongouveia/Whisper-Temple-Synthetic-ASR-Dataset-Generator

Repository files navigation

Project under construction. First complete release early May 2024 🚧👷‍♂️

What is missing now:

🟡 In some OS dark theme template of pyqtdarktheme is not found.

🟡 Audio capture mistake (first 0.2 seconds are corrupted, I believe it is due to audio driver latency or python multithread erratic behavior).

🔴 HF export not complete ´, including HF.

Synthetic Speech Dataset Generator (SpeechGen)

With the rising number of ASR/NLP open source projects democratizing the AI human-machine interface comes with the necessity of getting better ASR datasets. <Whisper Temple delves creates a simple-to-use platform to create Synthetic speech datasets, creating pairs of audio and text. Translation powered by faster-whisper ⏩ Synthetic translations can be edited in UI Data viewer. User interface is created using PyQt5 and runs totally on local machine.

Overview

This application serves as a Synthetic Speech Generator, enabling users to transcribe captured audio and manage generated datasets. It provides a user-friendly interface for configuring audio parameters, transcription options, and dataset management. progview

Features

  • Audio Capture: Users can capture audio samples with customizable settings such as sample rate and duration.
  • Transcription: Provides the option to transcribe captured audio into text.
  • Audio Metadata: Allows to add metadata to dataset, such as audio sample rate and duration.
  • Dataset Management: Enables users to view, delete, and manage entries in the generated dataset.
  • Export: Allows exporting of the dataset for further processing or Hugging Face 🤗.

Future Releases

Adding metadata to each dataset entry, audio sample rate, length, or speaker gender and age. ReadMe update with UI screenshot and video

Installation (Experimenta. Not yet complete; will do Pypi and conda install)

First, I suggest you create and activate a new virtual environment using conda or virtenv. Then follow steps ⬇️

  1. Clone the repository:
    https://github.com/gongouveia/Syntehtic-Speech-Dataset-Generator.git
  2. Install dependencies in requirements.txt
  3. Follow instruction in Usage Section.

Usage

  1. Launch the application and create or continue a project by running python speech_gen.py --project <project_name> --theme <default:'auto', 'light', 'dark'>,
  2. Configure audio capture parameters such as sample rate in KHz default: 16000 and duration in milliseconds default: 5000.
  3. If CUDA is found, it is possibel to transcribe audio records at the ed of each recording. Otherwise, yo can batch transcribe the audios in the DatasetViewer..
  4. Choose whether to use VAD option in transcripion or not, default is enabled and allows for a faster trancription.
  5. Click on "Capture Audio" to start a new audio recording.
  6. View and manage Audio dataset using provided menu options.
  7. Edit weak Transcriptions, creating a even more robust training dataset for Whisper.

Notes

If the Idiom argument is set to ('en'), the languages dropdown menu is not available. If option 3 is disabled, it is possible to transcribe all the captured audios in the dataset viewer window. You can add audios to the Audio dataset by pasting them in the /Audios folder under your desired project.

Configuration

  • Audio Sample Rate: Set the sample rate for audio capture (in KHz).
  • Audio Duration: Define the duration of audio samples to capture (in milliseconds).
  • Transcribe: Choose whether to transcribe captured audio (Yes/No).
  • VAD: Enable or disable VAD in transcription (Yes/No).

Dataset Management

  • View Dataset: Opens a new window to view the generated dataset.
  • Refresh Dataset: Refreshes dataset, use if changed metadata.csv.
  • Delete Entry: Deletes the last recorded entry from the dataset.

Exporting Dataset as Hugging Face Audio Dataset

To export the final dataset as a Hugging Face 🤗 Datasets, use the Command-Line Interface (CLI) provided. [https://huggingface.co/docs/datasets/audio_dataset]

You can log in to UI by providing the hf token [https://huggingface.co/docs/hub/security-tokens].

Future releases or on demand solutions:

Dependinfg on community or necessity, this features will be merged:

  • Adding a new translation engine or more translation configuration options;
  • Adding more metadata to the Dataset, such as speaker and file type information;
  • Export as kaldi ☕ dataset format;
  • Adding a loading bars for the dataset batch translation;
  • New window to train whisper with the new pseudo-synthetic dataset (on-request, contact me if you need this solution).

Contributing

Contributions to this project are welcome! If you'd like to contribute, please follow the standard GitHub workflow:

  1. Fork the repository.
  2. Create a new branch for your feature (git checkout -b feature/your-feature).
  3. Commit your changes (git commit -am 'Add some feature').
  4. Push to the branch (git push origin feature/your-feature).
  5. Create a new Pull Request.

Author

For any inquiries or collaboration, please contact [[email protected]]. I would be thankful to be cited in created datasets.

About

(Still not complete!!) This UI serves as a Synthetic ASR Dataset Generator powered by/for OpenAI Whisper, enabling users to capture audio, transcribing it, on the fly and manage the generated dataset. It provides a user-friendly interface for configuring audio parameters, transcription options, and dataset management.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages