Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GigaSpeech dataset in SpeechBrain #2405

Draft
wants to merge 40 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
962ad50
data prep scripts update
Adel-Moumen Feb 10, 2024
39b5049
iterate over utterances
Adel-Moumen Feb 10, 2024
b313734
without parallel map
Adel-Moumen Feb 10, 2024
7bdb17f
parallel map -> so fast omfg
Adel-Moumen Feb 10, 2024
a3d2d4c
gigaspeech data prep done
Adel-Moumen Feb 10, 2024
4cb3257
speechcolab extra dep if one must download gigaspeech
Adel-Moumen Feb 10, 2024
e521cc1
create ASR CTC folder
Adel-Moumen Feb 10, 2024
92a17c1
base yaml + update data prep to better reflect potential different na…
Adel-Moumen Feb 10, 2024
4dd02a0
update recipe
Adel-Moumen Feb 10, 2024
c254085
update recipe to be compliant with gigaspeech csv
Adel-Moumen Feb 10, 2024
b4de83a
add transformers dep
Adel-Moumen Feb 10, 2024
c3afdcc
convert opus to wav
Adel-Moumen Feb 10, 2024
945b8bb
recipe --debug mode works.
Adel-Moumen Feb 10, 2024
ae91209
typo GRABAGE_UTTERANCE_TAGS -> GARBAGE_UTTERANCE_TAGS
Adel-Moumen Feb 10, 2024
28b4257
tmp DL file
Adel-Moumen Feb 11, 2024
3a6396c
update DL FILE
Adel-Moumen Feb 11, 2024
6e771d7
add DL file in ASR/CTC
Adel-Moumen Feb 11, 2024
ebfcddb
update extra_requirements.txt
Adel-Moumen Feb 11, 2024
a68d0b8
add support of savedir within Pretrained subclasses
Adel-Moumen Feb 12, 2024
b2ed2a9
add wbs requirements
Adel-Moumen Feb 12, 2024
4b8c533
webdataset
Adel-Moumen Feb 13, 2024
44785c0
remove print
Adel-Moumen Feb 13, 2024
e203d77
tmp files webdataset
Adel-Moumen Feb 13, 2024
9b44e8d
verbosity + metada.json
Adel-Moumen Feb 14, 2024
1426156
letzo now label_encoder can actually train + the recipe seems to work.
Adel-Moumen Feb 14, 2024
0786b0b
Merge branch 'develop' of https://github.com/Adel-Moumen/speechbrain …
Adel-Moumen Feb 14, 2024
99bdfb1
Merge branch 'speechbrain:develop' into gigaspeech
Adel-Moumen Feb 14, 2024
aaeee16
Merge branch 'gigaspeech' of https://github.com/Adel-Moumen/speechbra…
Adel-Moumen Feb 14, 2024
ce12662
remove wbs
Adel-Moumen Mar 18, 2024
ed3ba03
DL info
Adel-Moumen Mar 18, 2024
8ae360b
HF DL support
Adel-Moumen Mar 18, 2024
1601ddc
remove webdataset as it sucks :p
Adel-Moumen Mar 18, 2024
9531d0b
name
Adel-Moumen Mar 18, 2024
1356ff1
ngram commands
Adel-Moumen Mar 18, 2024
4fa921b
Merge branch 'speechbrain:develop' into gigaspeech
Adel-Moumen Mar 18, 2024
0485173
whisper baseline
Adel-Moumen Mar 18, 2024
b360f8b
fix HF
Adel-Moumen Mar 18, 2024
3d71a04
Merge remote-tracking branch 'speechbrain/develop' into gigaspeech
Adel-Moumen Mar 29, 2024
81884ee
pre-commit + sentencepiece char
Adel-Moumen Mar 29, 2024
0f3da32
remove csv
Adel-Moumen Mar 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 8 additions & 0 deletions recipes/GigaSpeech/ASR/CTC/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
to do

```bash
mkdir lm
git clone https://huggingface.co/wgb14/gigaspeech_lm lm
gunzip -c lm/3gram_pruned_1e7.arpa.gz > lm/3gram_pruned_1e7.arpa
gunzip -c lm/4gram.arpa.gz > lm/4gram.arpa
```
440 changes: 440 additions & 0 deletions recipes/GigaSpeech/ASR/CTC/dataset.py

Large diffs are not rendered by default.

95 changes: 95 additions & 0 deletions recipes/GigaSpeech/ASR/CTC/download_gigaspeech.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
"""
Note for reviewer: this is a temporary script. It may be removed in the future.
Note2: for EU/US users, using this script might be VERY slow. It is instead
recommended to use the HuggingFace script.

Download script for GigaSpeech dataset.

Download instructions: https://github.com/SpeechColab/GigaSpeech
Reference: https://arxiv.org/abs/2106.06909

Author
-------
* Adel Moumen, 2024
"""

import logging
from typing import Optional, Sequence, Union
import argparse

logger = logging.getLogger(__name__)


def download_gigaspeech(
password: str,
target_dir: str = ".",
dataset_parts: Optional[Union[str, Sequence[str]]] = "auto",
host: Optional[str] = "tsinghua",
) -> None:
"""Download GigaSpeech dataset.

Parameters
----------
password : str
The password to access the GigaSpeech dataset.
target_dir : str, optional
The path to the directory where the dataset will be downloaded.
dataset_parts : Union[str, Sequence[str]], optional
The parts of the dataset to be downloaded.
If "auto", all parts will be downloaded.
If a string, it should be a comma-separated list of parts to be downloaded.
If a list, it should be a list of parts to be downloaded.
host : str, optional
The host to be used for downloading the dataset.
The available hosts are described in https://github.com/SpeechColab/GigaSpeech.
"""
try:
from speechcolab.datasets.gigaspeech import GigaSpeech
except ImportError:
raise ImportError(
"Please install the speechcolab package to download the GigaSpeech dataset."
)
gigaspeech = GigaSpeech(target_dir)

if dataset_parts == ["auto"]:
dataset_parts = ["XL", "DEV", "TEST"]

for part in dataset_parts:
logging.info(f"Downloading GigaSpeech part: {part}")
gigaspeech.download(password, "{" + part + "}", host=host)

logger.info(f"GigaSpeech dataset finished downloading to {target_dir}.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download GigaSpeech dataset.")
parser.add_argument(
"--password",
type=str,
required=True,
help="The password to access the GigaSpeech dataset.",
)
parser.add_argument(
"--target_dir",
type=str,
default=".",
help="The path to the directory where the dataset will be downloaded.",
)
parser.add_argument(
"--dataset_parts",
type=str,
nargs="+", # '+' means one or more values will be collected into a list
default=["auto"],
help="The parts of the dataset to be downloaded.",
)
parser.add_argument(
"--host",
type=str,
default="tsinghua",
help="The host to be used for downloading the dataset.",
)
args = parser.parse_args()

download_gigaspeech(
args.password, args.target_dir, args.dataset_parts, args.host
)
3 changes: 3 additions & 0 deletions recipes/GigaSpeech/ASR/CTC/extra_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kenlm
speechcolab
transformers