New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Limit Error when using stt_multilingual_fastconformer_hybrid_large_pc model for long audio file (55 minutes) #9071
Comments
55 minutes is high for fastconformer models with full attention. |
Thank you, will try this out. |
yes, cc-by-4.0 grants commercial usage |
Thank you, I noticed one issue is that the model output had sporadic errors where it does not apply a space between two words randomly. Not sure why This issue occurs, for eg when it is supposed to transcribe 'that is' or 'community driven' it instead outputs thatis or communitydriven. Is this an inherent issue in the model output or is there some concatenation code which might be creating this issue, I used the model using the sample code in Huggingface: import nemo.collections.asr as nemo_asr transcriptions = asr_model.transcribe(["sample.wav"]) print(transcriptions) I wanted to know if this was something that can be mitigated before final output is produced or is it inherent to how the model outputs data |
Thanks for your feedback, its wonderful that it solved the longform inference problem. Its part of model, and provides PnC by default. Will consider to provide a new version with improvements. |
I am assuming by PnC you mean Punctuation and Capitalization. if so, then that is not the issue I was highlighting. I meant to say that when the transcribed words should be 'that is' or 'community driven' it removed the spaces between the two words in the output. So the output is 'thatis' or 'communitydriven' Is this because of the concatenation method used to concat output strings? and if so can we correct it before final output is generated as the code provided does not include any concat methods only inference method: import nemo.collections.asr as nemo_asr transcriptions = asr_model.transcribe(["sample.wav"]) print(transcriptions) |
Describe the bug
I used the Long-audio-transcription-Citrinet.ipynb notebook to transcribe a long audio file the default Citrinet Model performs well but due to the High WER I wanted to try some of the recent models so I swapped it with the fast conformer model and got this error during computation
Error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 30.85 GiB. GPU 0 has a total capacity of 14.75 GiB of which 9.75 GiB is free. Process 167515 has 4.99 GiB memory in use. Of the allocated memory 3.18 GiB is allocated by PyTorch, and 1.68 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
Environment overview (please complete the following information)
Followed instructions in Google collab notebook: Long-audio-transcription-Citrinet.ipynb
Environment details
Google collar notebook: Long-audio-transcription-Citrinet.ipynb
Wanted to know if the model I am using is the issue and if so then which model can I use from the newer models for longer audio file transcriptions (1 hour and greater)
The text was updated successfully, but these errors were encountered: