Icefall-based, Transducer ASR Model #2339

HSTEHSTEHSTE · 2023-11-28T16:25:21Z

This is a preliminary pull request, containing the ART-adaptation for Icefall-based ASR Models

beat-buesser · 2023-11-30T15:39:22Z

Hi @HSTEHSTEHSTE Thank you for your pull request! Could you please target one of the dev branches, either dev_1.17.0 or dev_1.18.0?

github-advanced-security

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

art/estimators/speech_recognition/pytorch_icefall.py

+
+| Repository link: https://github.com/k2-fsa/icefall/tree/master
+"""
+import ast


art/estimators/speech_recognition/pytorch_icefall.py

+| Repository link: https://github.com/k2-fsa/icefall/tree/master
+"""
+import ast
+from argparse import Namespace


art/estimators/speech_recognition/pytorch_icefall.py

+
+import numpy as np
+
+from art import config


art/estimators/speech_recognition/pytorch_icefall.py

+ features, _, _ = self.transform_model_input(x=masked_adv_input[sample_index])
+ x_lens = torch.tensor([features.shape[1]]).to(torch.int32).to(self.device)
+ y = k2.RaggedTensor(original_output[sample_index])
+ loss = self.transducer_model(x=features, x_lens=x_lens, y=y)


art/estimators/speech_recognition/pytorch_icefall.py

+ raise ValueError("This estimator does not support `postprocessing_defences`.")
+
+ # Set cpu/gpu device
+ self._device = torch.device("cpu")


art/estimators/speech_recognition/pytorch_icefall.py

+ isnan = torch.isnan(x[i])
+ nisnan = torch.sum(isnan).item()
+ if nisnan > 0:
+ logging.info("input isnan={}/{} {}".format(nisnan, x[i].shape, x[i][isnan], torch.max(torch.abs(x[i]))))


art/estimators/speech_recognition/pytorch_icefall.py

+import ast
+from argparse import Namespace
+import logging
+from typing import Dict, List, Optional, Tuple, TYPE_CHECKING, Union


art/estimators/speech_recognition/pytorch_icefall.py

+from art import config
+from art.estimators.pytorch import PyTorchEstimator
+from art.estimators.speech_recognition.speech_recognizer import SpeechRecognizerMixin, PytorchSpeechRecognizerMixin
+from art.utils import get_file


art/estimators/speech_recognition/pytorch_icefall.py

+ vtln_warp: float = 1.0
+
+ params = asdict(FbankConfig())
+ params.update({"sample_frequency": 16000, "snip_edges": False, "num_mel_bins": 23})


art/estimators/speech_recognition/pytorch_icefall.py

+
+ params = asdict(FbankConfig())
+ params.update({"sample_frequency": 16000, "snip_edges": False, "num_mel_bins": 23})
+ params["frame_shift"] *= 1000.0


tests/estimators/speech_recognition/test_pytorch_icefall.py

+ for line in lexicon_file:
+ if len(line.strip()) > 0: # and '<UNK>' not in line and '<s>' not in line and '</s>' not in line:
+ vocab_size += 1
+ except:


tests/estimators/speech_recognition/test_pytorch_icefall.py

+ if len(line.strip()) > 0:
+ word2id[line.split()[0]] = id
+ id += 1
+ except:


tests/estimators/speech_recognition/test_pytorch_icefall.py

+ for line in lexicon_file:
+ if len(line.strip()) > 0: # and '<UNK>' not in line and '<s>' not in line and '</s>' not in line:
+ vocab_size += 1
+ except:


tests/estimators/speech_recognition/test_pytorch_icefall.py

+ if len(line.strip()) > 0:
+ word2id[line.split()[0]] = id
+ id += 1
+ except:


art/estimators/speech_recognition/pytorch_icefall.py

+ # Set cpu/gpu device
+ self._device = torch.device("cpu")
+ if torch.cuda.is_available():
+ self._device = torch.device("cuda", 0)


Signed-off-by: Xinyuan Li <[email protected]>

…ple Jupyter notebook

f4str

This looks good, thanks for implementing the new ASR model. I left a few comments on things that should be addressed.

f4str · 2024-02-01T20:04:31Z

art/estimators/speech_recognition/pytorch_icefall.py

+ # Set cpu/gpu device
+ self._device = torch.device("cpu")
+ if torch.cuda.is_available():
+ self._device = torch.device("cuda", 0)


This is already defined in the PyTorchEstimator superclass and is redundant. Hence the CodeQL warning. This segment should be removed

f4str · 2024-02-01T22:17:14Z

art/estimators/speech_recognition/pytorch_icefall.py

+ isnan = torch.isnan(x[i])
+ nisnan = torch.sum(isnan).item()
+ if nisnan > 0:
+ logging.info("input isnan={}/{} {}".format(nisnan, x[i].shape, x[i][isnan], torch.max(torch.abs(x[i]))))


Suggested change

logging.info("input isnan={}/{} {}".format(nisnan, x[i].shape, x[i][isnan], torch.max(torch.abs(x[i]))))

logging.info(f"input isnan={nisnan}/{x[i][isnan]} {torch.max(torch.abs(x[i]))}")

Use f-strings to avoid confusion with string formatting

f4str · 2024-02-01T22:18:10Z

art/estimators/speech_recognition/pytorch_icefall.py

+ @dataclass
+ class FbankConfig:
+ # Spectogram-related part
+ dither: float = 0.0
+ window_type: str = "povey"
+ # Note that frame_length and frame_shift will be converted to milliseconds before torchaudio/Kaldi sees them
+ frame_length: float = 0.025
+ frame_shift: float = 0.01
+ remove_dc_offset: bool = True
+ round_to_power_of_two: bool = True
+ energy_floor: float = 1e-10
+ min_duration: float = 0.0
+ preemphasis_coefficient: float = 0.97
+ raw_energy: bool = True
+
+ # Fbank-related part
+ low_freq: float = 20.0
+ high_freq: float = -400.0
+ num_mel_bins: int = 40
+ use_energy: bool = False
+ vtln_low: float = 100.0
+ vtln_high: float = -500.0
+ vtln_warp: float = 1.0
+
+ params = asdict(FbankConfig())
+ params.update({"sample_frequency": 16000, "snip_edges": False, "num_mel_bins": 23})


The dataclass seems a bit extraneous. What's the issue with just using a dictionary?

f4str · 2024-02-01T22:18:55Z

art/estimators/speech_recognition/pytorch_icefall.py

+ features, _, _ = self.transform_model_input(x=masked_adv_input[sample_index])
+ x_lens = torch.tensor([features.shape[1]]).to(torch.int32).to(self.device)
+ y = k2.RaggedTensor(original_output[sample_index])
+ loss = self.transducer_model(x=features, x_lens=x_lens, y=y)


Should the loss not also be returned?

I initially assumed so too but apprently not (if the other estimator implementations are anything to go by)

f4str · 2024-02-01T22:19:42Z

.github/workflows/ci-icefall.yml

+ sudo apt-get update
+ sudo apt-get -y -q install ffmpeg libavcodec-extra
+ python -m pip install --upgrade pip setuptools wheel
+ pip3 install -r requirements_test.txt
+ apt-get update \
+ && apt-get install -y \
+ libgl1-mesa-glx \
+ libx11-xcb1 \
+ git \
+ gcc \
+ mono-mcs \
+ libavcodec-extra \
+ ffmpeg \
+ curl \
+ libsndfile-dev \
+ libsndfile1 \


Is there a reason why this is done as two apt install steps? Can't they be combined into one?

f4str · 2024-02-01T22:27:01Z

art/estimators/speech_recognition/pytorch_icefall.py

+ indices = torch.LongTensor(indices)
+ num_frames = torch.IntTensor([num_frames[idx] for idx in indices])
+ start_frames = torch.zeros(len(x), dtype=torch.int)


Suggested change

indices = torch.LongTensor(indices)

num_frames = torch.IntTensor([num_frames[idx] for idx in indices])

start_frames = torch.zeros(len(x), dtype=torch.int)

indices = torch.tensor(indices, dtype=torch.int64, device=self._device)

num_frames = torch.tensor([num_frames[idx] for idx in indices], dtype=torch.int32, device=self._device)

start_frames = torch.zeros(len(x), dtype=torch.int32, device=self._device)

Let's use the modern torch tensor instantiation on the proper device for optimal performance.

f4str · 2024-02-01T22:27:52Z

art/estimators/speech_recognition/pytorch_icefall.py

+ return self._input_shape # type: ignore
+
+ @property
+ def model(self):


Suggested change

def model(self):

def model(self) -> "torch.nn.Module":

Missing return type

contrib/lingvo-patched-decoder.py

tests/estimators/speech_recognition/test_pytorch_icefall.py

Signed-off-by: Xinyuan Li <[email protected]>

…link. Signed-off-by: Xinyuan Li <[email protected]>

HSTEHSTEHSTE marked this pull request as draft November 28, 2023 16:25

beat-buesser self-requested a review November 30, 2023 15:38

beat-buesser self-assigned this Nov 30, 2023

beat-buesser added the enhancement New feature or request label Nov 30, 2023

beat-buesser added this to the ART 1.18.0 milestone Nov 30, 2023

github-advanced-security bot found potential problems Nov 30, 2023

View reviewed changes

HSTEHSTEHSTE force-pushed the main branch from 995add8 to e3aac52 Compare December 22, 2023 19:07

HSTEHSTEHSTE marked this pull request as ready for review December 22, 2023 19:07

HSTEHSTEHSTE changed the base branch from main to dev_1.18.0 January 8, 2024 16:42

HSTEHSTEHSTE force-pushed the main branch from 2363a7b to fc9c2c6 Compare January 8, 2024 17:08

github-advanced-security bot found potential problems Jan 9, 2024

View reviewed changes

github-advanced-security bot found potential problems Jan 17, 2024

View reviewed changes

HSTEHSTEHSTE force-pushed the main branch from bc3c432 to 8f3205b Compare January 26, 2024 21:54

Add Pytorch Icefall speech recognizer to ART

39a661e

Signed-off-by: Xinyuan Li <[email protected]>

HSTEHSTEHSTE force-pushed the main branch from 8f3205b to 39a661e Compare January 26, 2024 21:57

Update pytest to remove dependency on forked branches; Implement exam…

e2ddf45

…ple Jupyter notebook

HSTEHSTEHSTE force-pushed the main branch from 49e3783 to e2ddf45 Compare January 31, 2024 20:31

HSTEHSTEHSTE changed the title ~~Todo: Attacks against Icefall-based, Transducer ASR Models~~ Icefall-based, Transducer ASR Model Jan 31, 2024

f4str suggested changes Feb 1, 2024

View reviewed changes

Xinyuan Li added 4 commits February 5, 2024 10:10

Remove swallowed exceptions

c094d3a

Signed-off-by: Xinyuan Li <[email protected]>

Undo changes on file that should not be touched

6df000b

Signed-off-by: Xinyuan Li <[email protected]>

Remove superfluous imports

b734069

Refactor icefall ci pipeline, replace Google Drive link with Dropbox …

6705b36

…link. Signed-off-by: Xinyuan Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Icefall-based, Transducer ASR Model #2339

Icefall-based, Transducer ASR Model #2339

HSTEHSTEHSTE commented Nov 28, 2023

beat-buesser commented Nov 30, 2023

github-advanced-security bot left a comment

f4str left a comment

f4str Feb 1, 2024

f4str Feb 1, 2024

f4str Feb 1, 2024

f4str Feb 1, 2024

HSTEHSTEHSTE Feb 5, 2024

f4str Feb 1, 2024

f4str Feb 1, 2024

f4str Feb 1, 2024

	logging.info("input isnan={}/{} {}".format(nisnan, x[i].shape, x[i][isnan], torch.max(torch.abs(x[i]))))
	logging.info(f"input isnan={nisnan}/{x[i][isnan]} {torch.max(torch.abs(x[i]))}")

Icefall-based, Transducer ASR Model #2339

Are you sure you want to change the base?

Icefall-based, Transducer ASR Model #2339

Conversation

HSTEHSTEHSTE commented Nov 28, 2023

beat-buesser commented Nov 30, 2023

github-advanced-security bot left a comment

Choose a reason for hiding this comment

f4str left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment