New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long text strings produce incomplete audio files #190
Comments
I think it's related to an issue I've started encountering a month ago where the service randomly stops responding with audio data. It's a problem I've observed in the Edge browser as well. I'm not sure how best to work around this but obviously a naive solution would be to retry a few times before accepting that the current split SSML doesn't have any audio data. Right now, if one of the split texts returns audio data; it doesn't raise an exception and considers it a success. |
Do you have any luck with the latest release (6.1.10)? |
Seems slightly better, but no luck. For context, I ran a generation 6 times on the a.txt file; there's about a megabyte or so missing in two of those files...
|
6.1.10 stops running halfway through. I tested it with a 100,000-word text file and an MP3 with only around 80,000 words. However, 6.1.9 could run through the entire process. Yet, the subtitles generated by 6.1.9 only capture very little text. |
I remember previously the data for generating the MP3 would incrementally increase until completion, but now it goes from 0 directly to completion. I'm not sure if this is the reason for the issue. |
@expwise Thanks for the info, I'll attempt a workaround in a bit. For the time being, I guess you'll need to stick to 6.1.9 as it works better somehow. It's worth mentioning that both have issues, it just seems like in your case 6.1.10 is worse.... My theory is that it has to do with the fact that 6.1.10 switches to the next chunk of ~64KiB text immediately without creating a new connection whereas 6.1.9 emulates the Microsoft Edge behavior of starting from a new connection. |
@rany2 Thank you for your efforts. Your project has been of great help to me. Well done! |
What's the status of this issue? When I first reported it I was using 6.1.9. |
@briankendall It's more complicated than I expected, the issue is that sometimes their API returns audio output partially on the same connection. So I can't just have a check on whether the current connection returned any audio and if not, retry; it's more complicated.... |
@rany2 Understood! I hope you can figure out a method for working around this. |
Maybe a workaround could be edge-tts to chunk text files into workable sizes, run them individually, then splice them back at the end? |
@lefnire we're doing that already, I tried different chunk sizes and I'm having the same issues regardless :( |
@rany2 aw bummer. Thanks for the reply. I just went through the gamut: tortoise-tts, coqui-ai/tts, bark, edge-tts. Edge was victorious; but for this one bug. Tortoise is unusably slow (but great realism). Coqui & Bark can't take large files, nor did I find their voices realistic. edge-tts shocked me in terms of realism and speed. Here's hoping there's a solution somehow! Huge bummer Edge browser doesn't support to-file, without weird hoops (recording audio-out overnight kinda deal). |
@lefnire not sure what you mean by to-file but you could actually save the mp3: edge-tts/examples/basic_generation.py Line 19 in e58af9d
|
@rany2 right right, I meant it's a shame that Microsoft Edge Browser doesn't do this natively. Hence a big value-add of this project. |
Hey People, I am struggling with the same piece, also like @lefnire for audiobook generation. But, I might have found some partial solution. In my case (python 3.9 on mac) I received errrors with
mostly either it produced some audio (often not complete), or it gave that error after a few seconds.
this inhibited the above mentioned error. But I still struggled with inclomplete audiofiles.... I then looked into the 'aiohttp.ClientSession' documentation, and found that there is a timeout of 300 seconds (5 minutes). My audiofiles where around 20 MB each, when they stopped being produced, and it took often about 5 minutes. After some iteration, I too changed this to 9000 seconds. (150 minutes):
Since then it seems to work much better again – but not perfect! I still get incomplete files, but less. Disclaimer: I tried other things, e.g. reducing the threshold for the "chopping of the texts" from I also want to declare to not really understand the technicalities, and to have quite randomly selected the 9000 seconds. As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier. @rany2 thank you a lot for your software – I really enjoy using it for my usecase, and listened to audiobooks created with this tool for many hours already. |
@tschnibo Thanks for researching and your kind words, I didn't know It seems like the
Makes sense. |
@rany2 Thank you for your friendly response! to illustrate the unfinished files yesterday, after applying this changes, it looked like this: this timedifference between "created" and "last changed" of 10 minutes seems like a pattern. disabling the ClientSession timeout seems like the right way to go, I totally agree. On the other hand, maybe one could define the timeout to be shorter than 10 minues, catch the timeout and proactively create a new session, or something like that – but this async-session-handling and OOP is not something which I easily see through – so I don't know what the easiest route would be. Maybe there is a way to just wait on the session to be terminated by the server, and then reconnect to a new session – but I don't know if this is actively communicated to the client by the server. looking forward to watch the further development in this issue. |
This addresses the issue described in #190 (comment) Signed-off-by: rany <[email protected]>
Signed-off-by: rany <[email protected]>
Signed-off-by: rany <[email protected]>
Can someone test if the version in master (not the one released in pypi) still has this issue? |
Signed-off-by: rany <[email protected]>
Signed-off-by: rany <[email protected]>
Nevermind it's still inconsistent when it comes to this, but the first few runs were fine. I got my hopes up when it was working the first couple runs ):
|
@rany2 I really had to rise both of the timeouts much more, to only have this 10minutes timeout now. maybe the first runs where "not further throttled" and then some sort of abuse-prevention on the server-side is activated, and this further slows the process? |
@tschnibo but there's no way that |
@rany2 to be honest, I have no clue. Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning, for when you would be able to reproduce this 10 minutes phenomenology ,maybe this would be indicative of some mechanism. |
with one of my examples I looked at the submitted text, and the produced .vtt file, and also some of the websocket (I think), messages. and it stopped somewhere in the middle of the submitted text, with returning messages:
... and then it starts with the next text, for the next file. I think it would be interesting to monitor the connection and see if there is some sort of termination message. |
There isn't unfortunately :(
Could you test the current version in master and see if you still get timeouts? The parameter now sets a timeout for socket recv, previously it was controlling the time it needs to get a websocket message response. |
Yes, I'll try to test... just doing this besides working a completely different job, cannot plan on when I accomplish the testing. |
@rany2 in order to make my task easier, I patched my existing installation with your changes, I hope I have done this correctly – the first few files went flawlessly, but now, also the chapters are maybe getting longer again (or some throttling kicks in a again), it just had displayed this 10 min cutoff again, with the unfinished processing. The next chapter went alright again (with a 34 MB audio generated, in 4 minutes), the next one cancelled after 10 minutes and 18 MB again... as did the next few chapters, until a much shorter chapter, which completed fine. so for me it seems like the behavior stays the same as with my extended timeouts, in terms of the files either being correctly (and maybe rather quickly) generated, or the process (is slower and) quits after 10 minutes for large texts, and might be successfull for shorter texts. I didn't have any timeout-errors like in the pypi version... |
@tschnibo so you're saying that the defaults right now don't need any adjusting? |
@rany2 I am not quite sure if I understand your question correctly. |
@tschnibo yep, thank you. I just wanted to know that the timeout values in master are fine now. |
@rany2 I am again working, and did no extended testing, but for the one audiobook (a different one than yesterday) it looks like it, yes! |
@rany2 sorry, I misjudged, at a closer look I again discover these:
and this didn't occur with my timeout values. only one occurence so far, and it could also be because I disconnected the notebook from my phone for some time, or something like that – I didn't monitor close enough. I'll run it again, and share the experience... edit: |
Just wondering, can the subtitles returned by the server used as a sanity check for data completeness? So, if the subtitles returned do not match the text sent, assume all data (including audio) to be incomplete. I haven't looked into it, but I can if you think it's not a stupid idea. |
@kovaacs It's not a stupid idea and I did consider it but the issue is that some characters are ignored by TTS depending on the voice selection (i.e., if you send Chinese characters to an English voice it will just ignore it; so the issue is that you need to somehow figure out all the values every voice takes) |
@rany2 I'm thinking maybe fuzzy matching could be an option. You compare the sent data with what was received, and see how similar they are. It could be an opt-in flag, e.g. |
@kovaacs Seems like it's more trouble than it's worth to be honest. I think as a workaround it would probably work but I'm not willing to implement it myself. |
I use the TTS output to manually check if the output is complete for the current chapter, or not. I don't know yet how to implement this session restart. If I would, then I would maybe just chop the texts into bits which don't take too long to be produced, and give it a session-restart time of about 5 minutes, in order to have a "safety-margin". so just check how old the session is, before sending the next chunk to be converted, and if it is above 5 minutes, restart the session first. Do you see how this could be implemented? |
I don't think it's that simple, it already divides the text into chunks and starts new sessions (new session or reusing old connection makes no diff); the issue is that I receive incomplete audio data from the service so I cannot be sure if the data is complete or not by just looking at whether there is audio or not. Also the 5 minutes thing is not really reliable, I've had it happen in my test within 2 mins; the trouble is that it is inconsistent and there doesn't seem to be a pattern I could find. |
I think I've found a solution, it seems like an off-by-one error on my end and the fix I initially tried would have worked; I'll keep you guys posted :') |
Please test latest master (make sure to include 580f880) |
@rany2 |
- Fixes rany2/edge-tts#190 - Fixes aiohttp timeout issue - Improves performance on larger inputs Signed-off-by: rany <[email protected]>
- Fixes rany2/edge-tts#190 - Fixes aiohttp timeout issue - Improves performance on larger inputs Signed-off-by: rany <[email protected]>
I'm trying to use edge-tts to convert a chapter of a book into an audiobook. It's about 39k characters and around 7500 words. When I run it through edge-tts, the resulting audio file is often incomplete. At what point in the text it cuts off seems to be inconsistent and arbitrary, and every now and then it successfully produces audio for the entire text.
Any idea what's going wrong? Is this even a use case that's expected to work? (I wonder if Microsoft is limiting how much audio it'll generate for one request.)
The text was updated successfully, but these errors were encountered: