Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long text strings produce incomplete audio files #190

Closed
briankendall opened this issue Feb 12, 2024 · 43 comments · Fixed by hasscc/hass-edge-tts#46
Closed

Long text strings produce incomplete audio files #190

briankendall opened this issue Feb 12, 2024 · 43 comments · Fixed by hasscc/hass-edge-tts#46
Labels
help wanted Extra attention is needed

Comments

@briankendall
Copy link

I'm trying to use edge-tts to convert a chapter of a book into an audiobook. It's about 39k characters and around 7500 words. When I run it through edge-tts, the resulting audio file is often incomplete. At what point in the text it cuts off seems to be inconsistent and arbitrary, and every now and then it successfully produces audio for the entire text.

Any idea what's going wrong? Is this even a use case that's expected to work? (I wonder if Microsoft is limiting how much audio it'll generate for one request.)

@rany2
Copy link
Owner

rany2 commented Feb 15, 2024

I think it's related to an issue I've started encountering a month ago where the service randomly stops responding with audio data. It's a problem I've observed in the Edge browser as well.

I'm not sure how best to work around this but obviously a naive solution would be to retry a few times before accepting that the current split SSML doesn't have any audio data. Right now, if one of the split texts returns audio data; it doesn't raise an exception and considers it a success.

@rany2
Copy link
Owner

rany2 commented Feb 16, 2024

Do you have any luck with the latest release (6.1.10)?

@rany2
Copy link
Owner

rany2 commented Feb 16, 2024

Seems slightly better, but no luck. For context, I ran a generation 6 times on the a.txt file; there's about a megabyte or so missing in two of those files...

➜  edge-tts git:(master) ✗ wc a.txt
  1738  40238 269540 a.txt
➜  edge-tts git:(master) ✗ ls -lh *.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 85M Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 86M Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 87M Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 87M Feb 16 23:55 f.mp3
➜  edge-tts git:(master) ✗ ls -l *.mp3 
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 a.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 b.mp3
-rw-r--r-- 1 user user 88253712 Feb 17 00:03 c.mp3
-rw-r--r-- 1 user user 89767152 Feb 17 00:03 d.mp3
-rw-r--r-- 1 user user 90403632 Feb 17 00:02 e.mp3
-rw-r--r-- 1 user user 90403632 Feb 16 23:55 f.mp3

@expwise
Copy link

expwise commented Feb 19, 2024

6.1.10 stops running halfway through. I tested it with a 100,000-word text file and an MP3 with only around 80,000 words. However, 6.1.9 could run through the entire process. Yet, the subtitles generated by 6.1.9 only capture very little text.

@expwise
Copy link

expwise commented Feb 19, 2024

The text file contains 250,000 words, while both the MP3 and the subtitles consist of only around 80,000 words.
6.1.10:
image

@expwise
Copy link

expwise commented Feb 19, 2024

I remember previously the data for generating the MP3 would incrementally increase until completion, but now it goes from 0 directly to completion. I'm not sure if this is the reason for the issue.

@expwise
Copy link

expwise commented Feb 19, 2024

6.1.9:
image

@rany2
Copy link
Owner

rany2 commented Feb 19, 2024

@expwise Thanks for the info, I'll attempt a workaround in a bit. For the time being, I guess you'll need to stick to 6.1.9 as it works better somehow. It's worth mentioning that both have issues, it just seems like in your case 6.1.10 is worse....

My theory is that it has to do with the fact that 6.1.10 switches to the next chunk of ~64KiB text immediately without creating a new connection whereas 6.1.9 emulates the Microsoft Edge behavior of starting from a new connection.

@expwise
Copy link

expwise commented Feb 19, 2024

@rany2 Thank you for your efforts. Your project has been of great help to me. Well done!

@rany2 rany2 added the help wanted Extra attention is needed label Mar 14, 2024
@briankendall
Copy link
Author

What's the status of this issue? When I first reported it I was using 6.1.9.

@rany2
Copy link
Owner

rany2 commented Mar 23, 2024

@briankendall It's more complicated than I expected, the issue is that sometimes their API returns audio output partially on the same connection. So I can't just have a check on whether the current connection returned any audio and if not, retry; it's more complicated....

@briankendall
Copy link
Author

@rany2 Understood! I hope you can figure out a method for working around this.

@lefnire
Copy link

lefnire commented Apr 4, 2024

Maybe a workaround could be edge-tts to chunk text files into workable sizes, run them individually, then splice them back at the end?

@rany2
Copy link
Owner

rany2 commented Apr 4, 2024

@lefnire we're doing that already, I tried different chunk sizes and I'm having the same issues regardless :(

@lefnire
Copy link

lefnire commented Apr 5, 2024

@rany2 aw bummer. Thanks for the reply. I just went through the gamut: tortoise-tts, coqui-ai/tts, bark, edge-tts. Edge was victorious; but for this one bug. Tortoise is unusably slow (but great realism). Coqui & Bark can't take large files, nor did I find their voices realistic. edge-tts shocked me in terms of realism and speed. Here's hoping there's a solution somehow! Huge bummer Edge browser doesn't support to-file, without weird hoops (recording audio-out overnight kinda deal).

@rany2
Copy link
Owner

rany2 commented Apr 5, 2024

@lefnire not sure what you mean by to-file but you could actually save the mp3:

await communicate.save(OUTPUT_FILE)

@lefnire
Copy link

lefnire commented Apr 5, 2024

@rany2 right right, I meant it's a shame that Microsoft Edge Browser doesn't do this natively. Hence a big value-add of this project.

@tschnibo
Copy link

tschnibo commented May 16, 2024

Hey People,

I am struggling with the same piece, also like @lefnire for audiobook generation.
Earlier I was able to produce several books without problems, nowadays its a huge struggle.

But, I might have found some partial solution.

In my case (python 3.9 on mac) I received errrors with

asyncio.exceptions.TimeoutError

mostly either it produced some audio (often not complete), or it gave that error after a few seconds.
_therefore I upped the 'receive_timeout' in 'communicate.py' from 5 to 9000

    def __init__(
        self,
        text: str,
        voice: str = "Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)",
        *,
        rate: str = "+0%",
        volume: str = "+0%",
        pitch: str = "+0Hz",
        proxy: Optional[str] = None,
        receive_timeout: int = 9000,
    ):

this inhibited the above mentioned error. But I still struggled with inclomplete audiofiles....

I then looked into the 'aiohttp.ClientSession' documentation, and found that there is a timeout of 300 seconds (5 minutes).

My audiofiles where around 20 MB each, when they stopped being produced, and it took often about 5 minutes. After some iteration, I too changed this to 9000 seconds. (150 minutes):

        # Create a new connection to the service.
        ssl_ctx = ssl.create_default_context(cafile=certifi.where())
        
        # By default aiohttp uses a total 300 seconds (5min) timeout, 
        # it means that the whole operation should finish in 5 minutes... (not long enough)
        # ... therefore we extend this quite a lot.
        timeout = aiohttp.ClientTimeout(total=9000)

        async with aiohttp.ClientSession(
            timeout = timeout,
            trust_env=True,
        ) as session, session.ws_connect(
            f"{WSS_URL}&ConnectionId={connect_id()}",

Since then it seems to work much better again – but not perfect!

I still get incomplete files, but less.
What I observed for several files already: They got produced as incomplete 10 minutes after the initial creation of the file.
This could hint towards a upper limit of the connection 'server-side' of 10 minutes. (the timeout could still be client-side).
@rany2 I don't understand the software well enough. Is there an easy way to close the session after maybe 5 minutes and continue the text with a new session afterwards?

Disclaimer: I tried other things, e.g. reducing the threshold for the "chopping of the texts" from websocket_max_size: int = 2**16 to websocket_max_size: int = 2**12... this could have an effect too, but I don't think so. (as @rany2 already tested this anyways).

I also want to declare to not really understand the technicalities, and to have quite randomly selected the 9000 seconds.

As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.

@rany2 thank you a lot for your software – I really enjoy using it for my usecase, and listened to audiobooks created with this tool for many hours already.

@rany2
Copy link
Owner

rany2 commented May 17, 2024

@tschnibo Thanks for researching and your kind words, I didn't know ClientSession had a timeout and never actually faced any timeout errors so I don't think it's related to this issue specifically. I'll try to look into your points to see if they get me any closer to a resolution.

It seems like the timeout value for ClientSession is a timeout for the entire operation, which seems like something we wouldn't want in this context because a generation might take a very long time. I'll most likely disable it all together and increase the receive_timeout to a minute.

As for the reason this problem occurs I want to post a guess for discussion: maybe microsoft started throttling the response/output, so it takes longer nowadays, as it did earlier (which is my impression in anycase), and therefore these timeouts do matter nowadays, despite not having mattered earlier.

Makes sense.

@tschnibo
Copy link

tschnibo commented May 17, 2024

@rany2 Thank you for your friendly response!

to illustrate the unfinished files yesterday, after applying this changes, it looked like this:
Bildschirmfoto 2024-05-17 um 11 29 19

this timedifference between "created" and "last changed" of 10 minutes seems like a pattern.

disabling the ClientSession timeout seems like the right way to go, I totally agree. On the other hand, maybe one could define the timeout to be shorter than 10 minues, catch the timeout and proactively create a new session, or something like that – but this async-session-handling and OOP is not something which I easily see through – so I don't know what the easiest route would be. Maybe there is a way to just wait on the session to be terminated by the server, and then reconnect to a new session – but I don't know if this is actively communicated to the client by the server.

looking forward to watch the further development in this issue.

rany2 added a commit that referenced this issue May 17, 2024
This addresses the issue described in #190 (comment)

Signed-off-by: rany <[email protected]>
rany2 added a commit that referenced this issue May 17, 2024
rany2 added a commit that referenced this issue May 17, 2024
@rany2
Copy link
Owner

rany2 commented May 17, 2024

Can someone test if the version in master (not the one released in pypi) still has this issue?

rany2 added a commit that referenced this issue May 17, 2024
rany2 added a commit that referenced this issue May 17, 2024
@rany2
Copy link
Owner

rany2 commented May 17, 2024

Nevermind it's still inconsistent when it comes to this, but the first few runs were fine. I got my hopes up when it was working the first couple runs ):

tests/001-long-text_a.mp3 tests/001-long-text_b.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_g.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.srt tests/001-long-text_g.srt differ: byte 177684, line 5505
tests/001-long-text_a.mp3 tests/001-long-text_h.mp3 differ: byte 27425332, line 110673
tests/001-long-text_a.mp3 tests/001-long-text_m.mp3 differ: byte 781, line 3
tests/001-long-text_a.mp3 tests/001-long-text_r.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_r.srt differ: byte 175768, line 5441
tests/001-long-text_a.mp3 tests/001-long-text_z.mp3 differ: byte 781, line 3
tests/001-long-text_a.srt tests/001-long-text_z.srt differ: byte 87159, line 2693

@tschnibo
Copy link

tschnibo commented May 17, 2024

@rany2 I really had to rise both of the timeouts much more, to only have this 10minutes timeout now.
Have you tried with similarly excessive timeouts as I did?

maybe the first runs where "not further throttled" and then some sort of abuse-prevention on the server-side is activated, and this further slows the process?

@rany2
Copy link
Owner

rany2 commented May 17, 2024

@tschnibo but there's no way that receive_timeout would be more than a minute? it's for sock recv... are you sure? The receive_timeout now is controlling the receive for the low-level socket not websocket

@tschnibo
Copy link

@rany2 to be honest, I have no clue. Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning, for receive_timeoutthen I still had this asyncio.exceptions.TimeoutError. But yes, because I changed both values, I am not 100% sure which one had which effect.

when you would be able to reproduce this 10 minutes phenomenology ,maybe this would be indicative of some mechanism.

@tschnibo
Copy link

tschnibo commented May 17, 2024

with one of my examples I looked at the submitted text, and the produced .vtt file, and also some of the websocket (I think), messages.

and it stopped somewhere in the middle of the submitted text, with returning messages:

{'type': 'WordBoundary', 'offset': 35077000000, 'duration': 1500000, 'text': 'that'}
{'type': 'WordBoundary', 'offset': 35078625000, 'duration': 1500000, 'text': 'have'}
{'type': 'WordBoundary', 'offset': 35080250000, 'duration': 6500000, 'text': 'significant'}
{'type': 'WordBoundary', 'offset': 35086875000, 'duration': 7750000, 'text': 'implications'}
{'type': 'WordBoundary', 'offset': 35094750000, 'duration': 1125000, 'text': 'for'}

... and then it starts with the next text, for the next file.

I think it would be interesting to monitor the connection and see if there is some sort of termination message.

@rany2
Copy link
Owner

rany2 commented May 17, 2024

see if there is some sort of termination message

There isn't unfortunately :(

Just with my timeout setting, the mp3 file is produced for 10 minutes and then it is finished uncomplete. when I chose smaller timeout values, in the beginning

Could you test the current version in master and see if you still get timeouts? The parameter now sets a timeout for socket recv, previously it was controlling the time it needs to get a websocket message response.

@tschnibo
Copy link

tschnibo commented May 17, 2024

Yes, I'll try to test... just doing this besides working a completely different job, cannot plan on when I accomplish the testing.

@tschnibo
Copy link

tschnibo commented May 18, 2024

@rany2 in order to make my task easier, I patched my existing installation with your changes, I hope I have done this correctly – the first few files went flawlessly, but now, also the chapters are maybe getting longer again (or some throttling kicks in a again), it just had displayed this 10 min cutoff again, with the unfinished processing.

The next chapter went alright again (with a 34 MB audio generated, in 4 minutes), the next one cancelled after 10 minutes and 18 MB again... as did the next few chapters, until a much shorter chapter, which completed fine.

so for me it seems like the behavior stays the same as with my extended timeouts, in terms of the files either being correctly (and maybe rather quickly) generated, or the process (is slower and) quits after 10 minutes for large texts, and might be successfull for shorter texts.

I didn't have any timeout-errors like in the pypi version...

@rany2
Copy link
Owner

rany2 commented May 18, 2024

@tschnibo so you're saying that the defaults right now don't need any adjusting?

@tschnibo
Copy link

@rany2 I am not quite sure if I understand your question correctly.
With your master-version I don't have these timeout-error-messages, like when I adjusted the timeouts myself – but the "unfinished" audio for long texts still occurs. Does that answer your question?

@rany2
Copy link
Owner

rany2 commented May 18, 2024

@tschnibo yep, thank you. I just wanted to know that the timeout values in master are fine now.

@tschnibo
Copy link

@rany2 I am again working, and did no extended testing, but for the one audiobook (a different one than yesterday) it looks like it, yes!

@tschnibo
Copy link

tschnibo commented May 18, 2024

@rany2 sorry, I misjudged, at a closer look I again discover these:

Traceback (most recent call last):
File "/opt/homebrew/bin/edge-tts", line 8, in
sys.exit(main())
File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 139, in main
loop.run_until_complete(amain())
File "/opt/homebrew/Cellar/[email protected]/3.9.19/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 132, in amain
await _run_tts(args)
File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/util.py", line 65, in _run_tts
async for chunk in tts.stream():
File "/opt/homebrew/lib/python3.9/site-packages/edge_tts/communicate.py", line 445, in stream
async for received in websocket:
File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/client_ws.py", line 312, in anext
msg = await self.receive()
File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/client_ws.py", line 244, in receive
msg = await self._reader.read()
File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/streams.py", line 663, in read
return await super().read()
File "/opt/homebrew/lib/python3.9/site-packages/aiohttp/streams.py", line 622, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

and this didn't occur with my timeout values.

only one occurence so far, and it could also be because I disconnected the notebook from my phone for some time, or something like that – I didn't monitor close enough.

I'll run it again, and share the experience...

edit:
so the next run was without such timeout messages... but with some 10 minutes-maxed-out-outputs.

@kovaacs
Copy link

kovaacs commented May 20, 2024

Just wondering, can the subtitles returned by the server used as a sanity check for data completeness? So, if the subtitles returned do not match the text sent, assume all data (including audio) to be incomplete. I haven't looked into it, but I can if you think it's not a stupid idea.

@rany2
Copy link
Owner

rany2 commented May 20, 2024

@kovaacs It's not a stupid idea and I did consider it but the issue is that some characters are ignored by TTS depending on the voice selection (i.e., if you send Chinese characters to an English voice it will just ignore it; so the issue is that you need to somehow figure out all the values every voice takes)

@kovaacs
Copy link

kovaacs commented May 20, 2024

@rany2 I'm thinking maybe fuzzy matching could be an option. You compare the sent data with what was received, and see how similar they are. It could be an opt-in flag, e.g.--min-confidence 0.8, meaning if the similarity is below 80%, consider the chunk a failure and retry it. It'd be up to the user to choose their similarity score and also to ensure that they don't send garbage data that would skew the score.

@rany2
Copy link
Owner

rany2 commented May 20, 2024

@kovaacs Seems like it's more trouble than it's worth to be honest. I think as a workaround it would probably work but I'm not willing to implement it myself.

@tschnibo
Copy link

I use the TTS output to manually check if the output is complete for the current chapter, or not.
For me the question would be, what do you do with this information? Do you restart the session?

I don't know yet how to implement this session restart. If I would, then I would maybe just chop the texts into bits which don't take too long to be produced, and give it a session-restart time of about 5 minutes, in order to have a "safety-margin".

so just check how old the session is, before sending the next chunk to be converted, and if it is above 5 minutes, restart the session first.

Do you see how this could be implemented?

@rany2
Copy link
Owner

rany2 commented May 21, 2024

I don't think it's that simple, it already divides the text into chunks and starts new sessions (new session or reusing old connection makes no diff); the issue is that I receive incomplete audio data from the service so I cannot be sure if the data is complete or not by just looking at whether there is audio or not.

Also the 5 minutes thing is not really reliable, I've had it happen in my test within 2 mins; the trouble is that it is inconsistent and there doesn't seem to be a pattern I could find.

@rany2
Copy link
Owner

rany2 commented May 21, 2024

I think I've found a solution, it seems like an off-by-one error on my end and the fix I initially tried would have worked; I'll keep you guys posted :')

@rany2 rany2 closed this as completed in 580f880 May 21, 2024
@rany2
Copy link
Owner

rany2 commented May 21, 2024

Please test latest master (make sure to include 580f880)

@kovaacs
Copy link

kovaacs commented May 21, 2024

@rany2 n = 1, but I've just exported a 14+ hour long file without problems. Thanks for fixing it so quickly, you saved me quite a lot of manual effort.

rany2 added a commit to hasscc/hass-edge-tts that referenced this issue May 22, 2024
- Fixes rany2/edge-tts#190
- Fixes aiohttp timeout issue
- Improves performance on larger inputs

Signed-off-by: rany <[email protected]>
rany2 added a commit to hasscc/hass-edge-tts that referenced this issue May 22, 2024
- Fixes rany2/edge-tts#190
- Fixes aiohttp timeout issue
- Improves performance on larger inputs

Signed-off-by: rany <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants