clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

nonnoxer · 2024-05-15T08:37:59Z

I was testing faster whisper with 2 very long audio files (about 30 min each). Both were generated using the gigaspeech dataset, with long.wav being many audio files concatenated into a continuous file and silence.wav being audio files joined with 3-minute-long complete silences in between. I then used silero VAD externally to generate speech timestamps for each file before passing these timestamps through the clip_timestamps parameter.

When testing this functionality on just silence.wav, the transcript generated was as expected. However, when running the model on long.wav first (where there was some hallucination) then silence.wav, the silence.wav transcription was completely hallucinated and the provided clip_timestamps were also not used.

Code:

from faster_whisper import WhisperModel
from faster_whisper.audio import decode_audio
import os
import torch


DATA_DIR = "test_data/wav/processed"
OUTPUT_DIR = "test_data/output/asr"
FS = 16000

model = WhisperModel("base")

vad_parameters = {
    "threshold": 0.28,
    "min_speech_duration_ms": 250,
    "max_speech_duration_s": 10,
    "min_silence_duration_ms": 100,
    "window_size_samples": 1536,
    "speech_pad_ms": 30
}

vad, utils = torch.hub.load("snakers4/silero-vad", model="silero_vad", onnx=True)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils

for file_name in os.listdir(DATA_DIR):
    output_file = os.path.splitext(file_name)[0] + ".txt"
    output_path = os.path.join(OUTPUT_DIR, output_file)

    file_path = os.path.join(DATA_DIR, file_name)
    
    # clip_timestamps generated here
    audio = decode_audio(file_path)
    speech_timestamps = get_speech_timestamps(torch.tensor(audio), vad, sampling_rate=FS, **vad_parameters)
    clip_timestamps_list = []
    for entry in speech_timestamps:
        clip_timestamps_list.append(str(entry["start"] / FS))
        clip_timestamps_list.append(str(entry["end"] / FS))
    clip_timestamps = ",".join(clip_timestamps_list)
    print(clip_timestamps_list)

    with open(output_path, "w") as f:
        segments, info = model.transcribe(
            audio,
            beam_size=5,
            vad_filter=False,
            clip_timestamps=clip_timestamps
        )

        for segment in segments:
            segment_text = segment.text.strip()
            f.write(f"{segment.start:.2f}\t{segment.end:.2f}\t{segment.text}\n")

Files:
https://drive.google.com/file/d/1mfDWNDmcZvW9M-zMaNWFy9tVUOdUrZnA/view?usp=sharing

Environment:
faster-whisper==1.0.2
onnxruntime-gpu==1.17.1
torch==2.2.2+cu121
torchaudio==2.2.2+cu121
torchvision==0.17.2+cu121

python 3.10.12
cuda 12.1
requirements.txt

Expected output:
clip_timestamps are used. This behaviour happens when only silence.wav is run by itself.

Actual output:
The model is transcribing silence (audio only starts at 180s), completely disregarding the given clip_timestamps. This is the same audio file as above, the only difference being that another audio was run before.

I feel this issue is worth raising as running the same audio should give the exact same outputs every time, which is inexplicably not happening here. Additionally the parameter clip_timestamps should cause the model to use the given timestamps when transcribing. Any advice or help as to why this happens and how this can be addressed will be greatly appreciated.

nonnoxer · 2024-05-16T02:35:53Z

Tried to examine how the relevant variables are being passed in the source code in transcribe.py

Outputting the length of clip_timestamps (argument passed to transcribe() ), options.clip_timestamps and seek_clips

long.wav
Len clip_timestamps list 714
Len clip_timestamps passed to transcribe() 5894 type <class 'str'>
Len options.clip_timestamps before split 5894 type <class 'str'>
Len options.clip_timestamps after split 714 type <class 'list'>
Len seek_clips 357
silence.wav
Len clip_timestamps list 26
Len clip_timestamps passed to transcribe() 214 type <class 'str'>
Len options.clip_timestamps before split 714 type <class 'list'>
Len options.clip_timestamps after split 714 type <class 'list'>
Len seek_clips 357

It seems on the second file options.clip_timestamps is not correctly updated from the passed argument clip_timestamps and results in the wrong seek_clips being used.

Changed the code from updating the TranscriptionOptions class instead of the options object which likely was the cause of unexpected behaviour

nonnoxer · 2024-05-16T03:18:49Z

Was able to do a simple fix for this problem, created a pull request #842

trungkienbkhn · 2024-05-17T09:35:33Z

@nonnoxer , tks for your PR. I merged it.

…N#842) * Fix SYSTRAN#839 Changed the code from updating the TranscriptionOptions class instead of the options object which likely was the cause of unexpected behaviour

nonnoxer added a commit to nonnoxer/faster-whisper that referenced this issue May 16, 2024

Fix SYSTRAN#839

97d6204

Changed the code from updating the TranscriptionOptions class instead of the options object which likely was the cause of unexpected behaviour

nonnoxer added a commit to nonnoxer/faster-whisper that referenced this issue May 16, 2024

Fix SYSTRAN#839

2b4c78d

nonnoxer changed the title ~~clip_timestamps does not work, cross audio hallucination [faster-whisper 1.0.2]~~ clip_timestamps does not work across multiple files [faster-whisper 1.0.2] May 16, 2024

trungkienbkhn closed this as completed in 4acdb5c May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

nonnoxer commented May 15, 2024

nonnoxer commented May 16, 2024

nonnoxer commented May 16, 2024

trungkienbkhn commented May 17, 2024

clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

Comments

nonnoxer commented May 15, 2024

nonnoxer commented May 16, 2024

nonnoxer commented May 16, 2024

trungkienbkhn commented May 17, 2024