add GigaSpeech dataset in SpeechBrain #2405

Adel-Moumen · 2024-02-10T14:52:00Z

What does this PR do?

Work in progress. Do not review/merge.

This PR aims at adding GigaSpeech dataset inside of SpeechBrain.

Webdataset support ?

At first, I designed the data prep script so that it supports out of the box webdataset. However, I found in my experiments that the way we were using webdataset was not optimal. Indeed, and as described in the Google Colab tutorial, we first create shards and then apply our traditional padding function, etc. The issue with that is that the shards may contain audio files with very long sequences and very short sequences resulting in a lot of padding. Furthermore, many of our SpeechBrain features are not working easily with shards. For instance, training a simple label encoder will require some engineering tricks to make it work on shards. Padding as well. Also, webdataset has evolved and it would require us to freeze the dependence on a very old version of webdataset. I had some time to look at other toolkit implementations and found that NeMo, for instance, was first sorting audio files, and THEN creating shards. I also found that Lhoste has its own webdataset implementation which seems more tailored to the speech modality. (Maybe we should have a closer look). Thus, I decided temporarily to remove webdataset from this PR. I think most of the people that will be using GigaSpeech XL split will have access to more than 1 TB of storage and it won't be an issue at all. I am open to discussion but it will require some design discussion.

General Todo

To do:

add opus -> wav function if users wants to change codec
data prep
use parallel_map so that it is very fast
whisper training and yaml recipe
train whisper and report back results
add recipe tests
M and XL subsets

CTC

To do:

wavLM recipe
switch label encoder to sentencepiece
train a wavLM model on split M and XL size and report back results
add ngram DL (https://www.dropbox.com/scl/fo/b9z77qau03ocogk65sz2m/h?rlkey=8uwoomqszjiohnyacvfwlb2xx&dl=0)

Maybe one day:

webdataset for data prep
webdataset for training

Reference

k2 Icefall
Model | Dev | Test
zipformer | 10.25 | 10.38
conformer_ctc | 10.47 | 10.58
pruned_transducer_stateless2 | 10.40 | 10.51

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

…ming for csvs

…into gigaspeech

… into gigaspeech

Adel-Moumen added 6 commits February 10, 2024 11:19

data prep scripts update

962ad50

iterate over utterances

39b5049

without parallel map

b313734

parallel map -> so fast omfg

7bdb17f

gigaspeech data prep done

a3d2d4c

speechcolab extra dep if one must download gigaspeech

4cb3257

Adel-Moumen self-assigned this Feb 10, 2024

Adel-Moumen added enhancement New feature or request work in progress Not ready for merge recipes Changes to recipes only (add/edit) labels Feb 10, 2024

Adel-Moumen added 20 commits February 10, 2024 15:53

create ASR CTC folder

e521cc1

base yaml + update data prep to better reflect potential different na…

92a17c1

…ming for csvs

update recipe

4dd02a0

update recipe to be compliant with gigaspeech csv

c254085

add transformers dep

b4de83a

convert opus to wav

c3afdcc

recipe --debug mode works.

945b8bb

typo GRABAGE_UTTERANCE_TAGS -> GARBAGE_UTTERANCE_TAGS

ae91209

tmp DL file

28b4257

update DL FILE

3a6396c

add DL file in ASR/CTC

6e771d7

update extra_requirements.txt

ebfcddb

add support of savedir within Pretrained subclasses

a68d0b8

add wbs requirements

b2ed2a9

webdataset

4b8c533

remove print

44785c0

tmp files webdataset

e203d77

verbosity + metada.json

9b44e8d

letzo now label_encoder can actually train + the recipe seems to work.

1426156

Merge branch 'develop' of https://github.com/Adel-Moumen/speechbrain …

0786b0b

…into gigaspeech

Adel-Moumen and others added 14 commits February 14, 2024 11:36

Merge branch 'speechbrain:develop' into gigaspeech

99bdfb1

Merge branch 'gigaspeech' of https://github.com/Adel-Moumen/speechbrain…

aaeee16

… into gigaspeech

remove wbs

ce12662

DL info

ed3ba03

HF DL support

8ae360b

remove webdataset as it sucks :p

1601ddc

name

9531d0b

ngram commands

1356ff1

Merge branch 'speechbrain:develop' into gigaspeech

4fa921b

whisper baseline

0485173

fix HF

b360f8b

Merge remote-tracking branch 'speechbrain/develop' into gigaspeech

3d71a04

pre-commit + sentencepiece char

81884ee

remove csv

0f3da32

Adel-Moumen added this to the v1.0.2 milestone May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GigaSpeech dataset in SpeechBrain #2405

add GigaSpeech dataset in SpeechBrain #2405

Adel-Moumen commented Feb 10, 2024 •

edited

add GigaSpeech dataset in SpeechBrain #2405

Are you sure you want to change the base?

add GigaSpeech dataset in SpeechBrain #2405

Conversation

Adel-Moumen commented Feb 10, 2024 • edited

What does this PR do?

Webdataset support ?

General Todo

CTC

Reference

PR review

Adel-Moumen commented Feb 10, 2024 •

edited