GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

tempops · 2024-04-30T17:53:27Z

Describe the bug

I used the Long-audio-transcription-Citrinet.ipynb notebook to transcribe a long audio file the default Citrinet Model performs well but due to the High WER I wanted to try some of the recent models so I swapped it with the fast conformer model and got this error during computation

Error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 30.85 GiB. GPU 0 has a total capacity of 14.75 GiB of which 9.75 GiB is free. Process 167515 has 4.99 GiB memory in use. Of the allocated memory 3.18 GiB is allocated by PyTorch, and 1.68 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

Environment overview (please complete the following information)

Followed instructions in Google collab notebook: Long-audio-transcription-Citrinet.ipynb

Environment details

Google collar notebook: Long-audio-transcription-Citrinet.ipynb

Wanted to know if the model I am using is the issue and if so then which model can I use from the newer models for longer audio file transcriptions (1 hour and greater)

nithinraok · 2024-05-08T17:56:29Z

55 minutes is high for fastconformer models with full attention.
Could you try with this model: https://huggingface.co/nvidia/parakeet-tdt_ctc-1.1b

tempops · 2024-05-15T14:33:05Z

Thank you, will try this out.
By the way as per the documentation in hugging face under which license does this and other parakeet models (rnnt) and come under? It says cc by 4.0 does this grant it commercial use?

nithinraok · 2024-05-15T15:24:21Z

yes, cc-by-4.0 grants commercial usage

tempops · 2024-05-23T17:33:57Z

Thank you,
I tried using the parakeet-tdt_ctc-1.1b model, inference time on GPU for long form audio is really good. The WER is also small except for names and surnames but that is understandable.

I noticed one issue is that the model output had sporadic errors where it does not apply a space between two words randomly. Not sure why This issue occurs, for eg when it is supposed to transcribe 'that is' or 'community driven' it instead outputs thatis or communitydriven.

Is this an inherent issue in the model output or is there some concatenation code which might be creating this issue, I used the model using the sample code in Huggingface:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-1.1b")

transcriptions = asr_model.transcribe(["sample.wav"])

print(transcriptions)

I wanted to know if this was something that can be mitigated before final output is produced or is it inherent to how the model outputs data

nithinraok · 2024-05-23T18:16:52Z

Thanks for your feedback, its wonderful that it solved the longform inference problem.

Its part of model, and provides PnC by default. Will consider to provide a new version with improvements.

tempops · 2024-05-24T14:34:54Z

I am assuming by PnC you mean Punctuation and Capitalization. if so, then that is not the issue I was highlighting. I meant to say that when the transcribed words should be 'that is' or 'community driven' it removed the spaces between the two words in the output. So the output is 'thatis' or 'communitydriven'

Is this because of the concatenation method used to concat output strings? and if so can we correct it before final output is generated as the code provided does not include any concat methods only inference method:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-1.1b")

transcriptions = asr_model.transcribe(["sample.wav"])

print(transcriptions)

tempops added the bug Something isn't working label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

tempops commented Apr 30, 2024

nithinraok commented May 8, 2024

tempops commented May 15, 2024

nithinraok commented May 15, 2024

tempops commented May 23, 2024

nithinraok commented May 23, 2024

tempops commented May 24, 2024

GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

Comments

tempops commented Apr 30, 2024

nithinraok commented May 8, 2024

tempops commented May 15, 2024

nithinraok commented May 15, 2024

tempops commented May 23, 2024

nithinraok commented May 23, 2024

tempops commented May 24, 2024