Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071

Open
tempops opened this issue Apr 30, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@tempops
Copy link

tempops commented Apr 30, 2024

Describe the bug

I used the Long-audio-transcription-Citrinet.ipynb notebook to transcribe a long audio file the default Citrinet Model performs well but due to the High WER I wanted to try some of the recent models so I swapped it with the fast conformer model and got this error during computation

Error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 30.85 GiB. GPU 0 has a total capacity of 14.75 GiB of which 9.75 GiB is free. Process 167515 has 4.99 GiB memory in use. Of the allocated memory 3.18 GiB is allocated by PyTorch, and 1.68 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

Environment overview (please complete the following information)

Followed instructions in Google collab notebook: Long-audio-transcription-Citrinet.ipynb

Environment details

Google collar notebook: Long-audio-transcription-Citrinet.ipynb

Wanted to know if the model I am using is the issue and if so then which model can I use from the newer models for longer audio file transcriptions (1 hour and greater)

@tempops tempops added the bug Something isn't working label Apr 30, 2024
@nithinraok
Copy link
Collaborator

55 minutes is high for fastconformer models with full attention.
Could you try with this model: https://huggingface.co/nvidia/parakeet-tdt_ctc-1.1b

@tempops
Copy link
Author

tempops commented May 15, 2024

Thank you, will try this out.
By the way as per the documentation in hugging face under which license does this and other parakeet models (rnnt) and come under? It says cc by 4.0 does this grant it commercial use?

@nithinraok
Copy link
Collaborator

yes, cc-by-4.0 grants commercial usage

@tempops
Copy link
Author

tempops commented May 23, 2024

Thank you,
I tried using the parakeet-tdt_ctc-1.1b model, inference time on GPU for long form audio is really good. The WER is also small except for names and surnames but that is understandable.

I noticed one issue is that the model output had sporadic errors where it does not apply a space between two words randomly. Not sure why This issue occurs, for eg when it is supposed to transcribe 'that is' or 'community driven' it instead outputs thatis or communitydriven.

Is this an inherent issue in the model output or is there some concatenation code which might be creating this issue, I used the model using the sample code in Huggingface:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-1.1b")

transcriptions = asr_model.transcribe(["sample.wav"])

print(transcriptions)

I wanted to know if this was something that can be mitigated before final output is produced or is it inherent to how the model outputs data

@nithinraok
Copy link
Collaborator

Thanks for your feedback, its wonderful that it solved the longform inference problem.

Its part of model, and provides PnC by default. Will consider to provide a new version with improvements.

@tempops
Copy link
Author

tempops commented May 24, 2024

I am assuming by PnC you mean Punctuation and Capitalization. if so, then that is not the issue I was highlighting. I meant to say that when the transcribed words should be 'that is' or 'community driven' it removed the spaces between the two words in the output. So the output is 'thatis' or 'communitydriven'

Is this because of the concatenation method used to concat output strings? and if so can we correct it before final output is generated as the code provided does not include any concat methods only inference method:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt_ctc-1.1b")

transcriptions = asr_model.transcribe(["sample.wav"])

print(transcriptions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants