Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Hifi-GAN semantic tokens #2442

Open
wants to merge 30 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d6c2312
fix save_path
poonehmousavi Feb 7, 2024
10247c1
add bitrate flexibility and eduplication
poonehmousavi Feb 9, 2024
9322378
fix precommit
poonehmousavi Feb 9, 2024
5144667
add checkpointer for kmeans training
poonehmousavi Feb 12, 2024
da02405
add subword tokenizer recipie for discrete tokens
poonehmousavi Feb 13, 2024
d59f15b
fix precommit
poonehmousavi Feb 13, 2024
23b92b0
fix test CI
poonehmousavi Feb 13, 2024
f24614e
conflict
poonehmousavi Feb 13, 2024
488c09f
fix CI
poonehmousavi Feb 13, 2024
e0a9270
fix conflicts
poonehmousavi Feb 13, 2024
ee338cd
Merge branch 'develop' into Semantic_tokens
poonehmousavi Feb 13, 2024
3e6242c
add tokenizer
poonehmousavi Feb 15, 2024
cd0dd7e
major refactoring
poonehmousavi Feb 16, 2024
1b46d96
fix pre-commit
poonehmousavi Feb 16, 2024
d2a7a87
fix CI
poonehmousavi Feb 16, 2024
db1b999
fix CL and minor cleaning
poonehmousavi Feb 17, 2024
c6c3517
fix import in docstring
poonehmousavi Feb 17, 2024
dd77d5f
update wavlm and wav2vec kmeans training yaml file
poonehmousavi Feb 17, 2024
af42120
fix minor bug
poonehmousavi Feb 18, 2024
9613c12
fix CI
poonehmousavi Feb 19, 2024
38b0029
fix precommit
poonehmousavi Feb 19, 2024
76ef27a
fix bug
poonehmousavi Feb 26, 2024
06325cf
update all kmenas receie to have checkpointing
poonehmousavi Feb 28, 2024
9e8fcd3
Merge branch 'develop' into Semantic_tokens
poonehmousavi Feb 28, 2024
0be893b
add qunatization recepie for voxceleb
poonehmousavi Feb 29, 2024
15a6ec0
Merge branch 'Semantic_tokens' of https://github.com/poonehmousavi/sp…
poonehmousavi Feb 29, 2024
a19c59c
add multi-codebook hifigan
Chaanks Feb 29, 2024
7e4575a
fix multi-codebook hifigan
Chaanks Feb 29, 2024
a384051
remove unusued import
Chaanks Mar 1, 2024
e0d1e97
fix model selection
Chaanks Mar 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 1 addition & 7 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,7 @@ def pytest_generate_tests(metafunc):
except ModuleNotFoundError:
collect_ignore.append("speechbrain/utils/kmeans.py")
collect_ignore.append(
"speechbrain/lobes/models/huggingface_transformers/discrete_hubert.py"
)
collect_ignore.append(
"speechbrain/lobes/models/huggingface_transformers/discrete_wav2vec2.py"
)
collect_ignore.append(
"speechbrain/lobes/models/huggingface_transformers/discrete_wavlm.py"
"speechbrain/lobes/models/huggingface_transformers/discrete_ssl.py"
)
try:
import peft # noqa: F401
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,16 @@ sample_rate: 16000
# longer sentences certainly correspond to "open microphones".
avoid_if_longer_than: 10.0

ssl_hub: facebook/hubert-base-ls960
ssl_hub: facebook/hubert-large-ll60k
freeze_feature_extractor: True
freeze_ssl: True
ssl_folder: !ref <save_folder>/hubert_checkpoint
ssl_layer_num: 7
batch_size: 128 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
dataloader_num_workers: 8
sorting: ascending
checkpoint_interval: 100


# Dataloader options
dataloader_options:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ ssl_layer_num: 7
batch_size: 128 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
dataloader_num_workers: 8
sorting: ascending
checkpoint_interval: 100

# Dataloader options
dataloader_options:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ ssl_layer_num: 7
batch_size: 128 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
dataloader_num_workers: 8
sorting: ascending
checkpoint_interval: 100

# Dataloader options
dataloader_options:
Expand Down
18 changes: 17 additions & 1 deletion recipes/CommonVoice/quantization/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,23 @@ def audio_pipeline(wav):
train_set, **hparams["dataloader_options"]
)

os.makedirs(hparams["save_folder"], exist_ok=True)
# If you use dataloader checkpoints, make sure to keep all the settings as in the previous run and keep the dataset ordering the same.
dataloader_path = os.path.join(
hparams["save_folder"], "dataloader-TRAIN.ckpt"
)
if os.path.exists(dataloader_path):
logger.info(
f"The dataloader checkpoint is loaded from {dataloader_path}."
)
train_set._speechbrain_load(dataloader_path, False)

# Load pretrained KMeans model if it exists. Otherwise, create new one.
checkpoint_path = os.path.join(
hparams["save_folder"], f"kmeans_{hparams['num_clusters']}.pt"
hparams["save_folder"],
f"kmeans-cluster-{hparams['num_clusters']}-layer-{hparams['ssl_layer_num']}.pt",
)

kmeans_model = fetch_kmeans_model(
n_clusters=hparams["num_clusters"],
init=hparams["init"],
Expand All @@ -145,10 +158,13 @@ def audio_pipeline(wav):
kmeans_model,
train_set,
hparams["ssl_model"],
hparams["save_folder"],
hparams["ssl_layer_num"],
kmeans_batch_size=hparams["kmeans_batch_size"],
device=run_opts["device"],
checkpoint_interval=hparams["checkpoint_interval"],
)

logger.info(f"Saving kmeans model at {checkpoint_path}.")
save_model(kmeans_model, checkpoint_path)
train_set._speechbrain_save(dataloader_path)
3 changes: 2 additions & 1 deletion recipes/IEMOCAP/quantization/hparams/train_with_hubert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,13 @@ split_ratio: [80, 10, 10]
skip_prep: False
sample_rate: 16000

ssl_hub: facebook/hubert-base-ls960
ssl_hub: facebook/hubert-large-ll60k
freeze_feature_extractor: True
freeze_ssl: True
ssl_folder: !ref <save_folder>/hubert_checkpoint
ssl_layer_num: 7
batch_size: 128 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
checkpoint_interval: 100

# Dataloader options
train_dataloader_opts:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ freeze_ssl: True
ssl_folder: !ref <save_folder>/wav2vec_checkpoint
ssl_layer_num: 7
batch_size: 64 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
checkpoint_interval: 100

# Dataloader options
train_dataloader_opts:
Expand Down
1 change: 1 addition & 0 deletions recipes/IEMOCAP/quantization/hparams/train_with_wavlm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ freeze_ssl: True
ssl_folder: !ref <save_folder>/wavlm_checkpoint
ssl_layer_num: 7
batch_size: 32 # batch_size for loading and extracting features. It is different from kmeans_batch_size.
checkpoint_interval: 100

# Dataloader options
train_dataloader_opts:
Expand Down
18 changes: 17 additions & 1 deletion recipes/IEMOCAP/quantization/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,23 @@ def audio_pipeline(wav):
train_set, **hparams["train_dataloader_opts"]
)

os.makedirs(hparams["save_folder"], exist_ok=True)
# If you use dataloader checkpoints, make sure to keep all the settings as in the previous run and keep the dataset ordering the same.
dataloader_path = os.path.join(
hparams["save_folder"], "dataloader-TRAIN.ckpt"
)
if os.path.exists(dataloader_path):
logger.info(
f"The dataloader checkpoint is loaded from {dataloader_path}."
)
train_set._speechbrain_load(dataloader_path, False)

# Load pretrained KMeans model if it exists. Otherwise, create new one.
checkpoint_path = os.path.join(
hparams["save_folder"], f"kmeans_{hparams['num_clusters']}.pt"
hparams["save_folder"],
f"kmeans-cluster-{hparams['num_clusters']}-layer-{hparams['ssl_layer_num']}.pt",
)

kmeans_model = fetch_kmeans_model(
n_clusters=hparams["num_clusters"],
init=hparams["init"],
Expand All @@ -141,10 +154,13 @@ def audio_pipeline(wav):
kmeans_model,
train_set,
hparams["ssl_model"],
hparams["save_folder"],
hparams["ssl_layer_num"],
kmeans_batch_size=hparams["kmeans_batch_size"],
device=run_opts["device"],
checkpoint_interval=hparams["checkpoint_interval"],
)

logger.info(f"Saving kmeans model at {checkpoint_path}.")
save_model(kmeans_model, checkpoint_path)
train_set._speechbrain_save(dataloader_path)
99 changes: 67 additions & 32 deletions recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/extract_code.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
import json
import pathlib as pl

import joblib
import torch
import torchaudio
import numpy as np
Expand All @@ -19,13 +18,26 @@
load_pkl,
save_pkl,
)
from speechbrain.lobes.models.huggingface_transformers.wav2vec2 import Wav2Vec2
from speechbrain.lobes.models.huggingface_transformers import (
hubert,
wav2vec2,
wavlm,
)
from speechbrain.lobes.models.huggingface_transformers.discrete_ssl import (
DiscreteSSL,
)

OPT_FILE = "opt_ljspeech_extract.pkl"
OPT_FILE = "opt_ljspeech_extract_code.pkl"
TRAIN_JSON = "train.json"
VALID_JSON = "valid.json"
TEST_JSON = "test.json"

ENCODER_CLASSES = {
"HuBERT": hubert.HuBERT,
"Wav2Vec2": wav2vec2.Wav2Vec2,
"WavLM": wavlm.WavLM,
}


def setup_logger():
"""Set up a logger with a log format and logging level."""
Expand Down Expand Up @@ -94,7 +106,10 @@ def extract_ljspeech(
data_folder,
splits,
kmeans_folder,
encoder,
kmeans_dataset,
num_clusters,
encoder_type,
encoder_source,
layer,
save_folder,
sample_rate=16000,
Expand All @@ -110,11 +125,17 @@ def extract_ljspeech(
splits : list
List of splits to prepare.
kmeans_folder: str
Path to the folder where the k-means model checkpoint is stored.
encoder: str
Huggingface repository if that contains the pretrained kmean model.
kmeans_dataset : str
Name of the dataset that Kmeans model on HF repo is trained with.
num_clusters: (int)
determine the number of clusters of the targeted kmeans models to be downloaded.
encoder_type: str
Name of the model used as feature extractor.
encoder_source: str
Url to the model used as feature extractor.
layer: int
Layer from which features are extracted.
layer: List[int] (default: [7]):
Determine which layers of SSL should be used to extract information.
save_folder: str
Path to the folder where the speech units are stored.
sample_rate: int
Expand All @@ -124,14 +145,16 @@ def extract_ljspeech(

Example
-------
>>> from recipes.LJSpeech.S2ST.extract_code import extract_ljspeech
>>> from recipes.LJSpeech.TTS.vocoder.hifi_gan_unit.extract_code import extract_ljspeech
>>> data_folder = 'data/LJspeech/'
>>> splits = ['train', 'valid']
>>> kmeans_folder = ./Quantization/results/kmeans/4321/save
>>> encoder = facebook/hubert-base-ls960
>>> layer = 6
>>> kmeans_folder = 'speechbrain/SSL_Quantization'
>>> kmeans_dataset = LibriSpeech-100-360-500
>>> encoder_type = 'HuBERT'
>>> encoder_source = facebook/hubert-large-ll60k
>>> layer = [7]
>>> save_folder = 'save/'
>>> extract_ljspeech(data_folder, splits, kmeans_folder, encoder, layer, save_folder)
>>> extract_ljspeech(data_folder, splits, kmeans_folder, kmeans_filename, encoder_type, encoder_source, layer, save_folder)
"""
logger = setup_logger()

Expand All @@ -143,7 +166,8 @@ def extract_ljspeech(
"splits": splits,
"save_folder": save_folder,
"kmeans_folder": kmeans_folder,
"encoder": encoder,
"encoder_type": encoder_type,
"encoder_source": encoder_source,
"layer": layer,
}

Expand All @@ -158,26 +182,32 @@ def extract_ljspeech(

save_opt = save_folder / OPT_FILE
data_folder = pl.Path(data_folder)
kmeans_folder = pl.Path(kmeans_folder)
kmeans_ckpt = kmeans_folder / "kmeans.ckpt"
encoder_save_path = kmeans_folder / "pretrained_models"
save_path = save_folder / "savedir"
code_folder = save_folder / "codes"
code_folder.mkdir(parents=True, exist_ok=True)

logger.info(f"Loading encoder: {encoder} ...")
encoder = Wav2Vec2(
encoder,
encoder_save_path.as_posix(),
output_all_hiddens=True,
logger.info(f"Loading encoder: {encoder_source} ...")
if encoder_type not in ENCODER_CLASSES:
raise TypeError("Not a supported Encoder")

encoder_class = ENCODER_CLASSES[encoder_type]
encoder = encoder_class(
source=encoder_source,
save_path=save_path.as_posix(),
output_norm=False,
freeze_feature_extractor=True,
freeze=True,
freeze_feature_extractor=True,
apply_spec_augment=False,
output_all_hiddens=True,
).to(device)

# K-means model
logger.info(f"Loading K-means model from {kmeans_ckpt} ...")
kmeans_model = joblib.load(open(kmeans_ckpt, "rb"))
kmeans_model.verbose = False
discrete_encoder = DiscreteSSL(
save_path=save_path.as_posix(),
ssl_model=encoder,
kmeans_dataset=kmeans_dataset,
kmeans_repo_id=kmeans_folder,
num_clusters=num_clusters,
)

for split in splits:
dataset_path = data_folder / f"{split}.json"
Expand All @@ -193,11 +223,16 @@ def extract_ljspeech(
info.sample_rate, sample_rate,
)(audio)
audio = audio.unsqueeze(0).to(device)
feats = encoder.extract_features(audio)
feats = feats[layer]
feats = np_array(feats)
pred = kmeans_model.predict(feats)
np.save(code_folder / f"{key}.npy", pred)
deduplicates = [False for _ in layer]
bpe_tokenizers = [None for _ in layer]
tokens, _, _ = discrete_encoder(
audio,
SSL_layers=layer,
deduplicates=deduplicates,
bpe_tokenizers=bpe_tokenizers,
)
tokens = np_array(tokens.squeeze(0))
np.save(code_folder / f"{key}.npy", tokens)

logger.info("Extraction completed.")
save_pkl(conf, save_opt)
23 changes: 13 additions & 10 deletions recipes/LJSpeech/TTS/vocoder/hifi_gan_unit/hparams/train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,22 @@ use_tensorboard: False
#################################
# Data files and pre-processing #
#################################
data_folder: !PLACEHOLDER # e.g, /datasets/ljspeech 16k!
data_folder: !PLACEHOLDER # e.g, /datasets/ljspeech
train_json: !ref <save_folder>/train.json
valid_json: !ref <save_folder>/valid.json
test_json: !ref <save_folder>/test.json

splits: ["train", "valid"]
split_ratio: [90, 10]
splits: ["train", "valid", "test"]
split_ratio: [80, 10, 10]
skip_prep: False

kmeans_folder: !PLACEHOLDER # e.g, ../../quantization/results/kmeans/4321/save
kmeans_folder: speechbrain/SSL_Quantization
kmeans_dataset: LibriSpeech-100-360-500
num_clusters: 1000
codes_folder: !ref <save_folder>/codes
encoder_hub: facebook/hubert-base-ls960
layer: 6
encoder_type: HuBERT # one of [HuBERT, Wav2Vec2, WavLM]
encoder_hub: facebook/hubert-large-ll60k
layer: [12]

################################
# Audio Parameters #
Expand All @@ -43,7 +46,7 @@ layer: 6
segment_size: 8960
code_hop_size: 320
sample_rate: 16000

layer_drop: False

hop_length: 256
win_length: 1024
Expand Down Expand Up @@ -82,10 +85,10 @@ test_dataloader_opts:
################################
# Model Parameters and model #
################################
duration_predictor: True
duration_predictor: False

# embedding params
num_embeddings: 101 # K-means size + 1 for padding
num_embeddings: 1001 # K-means size + 1 for padding 128x3
embedding_dim: 128

# generator params
Expand Down Expand Up @@ -168,7 +171,7 @@ l1_spec_loss: !new:speechbrain.lobes.models.HifiGAN.L1SpecLoss
mel_normalized: !ref <mel_normalized>
power: !ref <power>
dynamic_range_compression: !ref <dynamic_range_compression>
mseg_dur_loss: True
mseg_dur_loss: False

generator_loss: !new:speechbrain.lobes.models.HifiGAN.GeneratorLoss
stft_loss: !ref <stft_loss>
Expand Down