Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

Open
towfiqi opened this issue Apr 30, 2019 · 10 comments
Open
Labels

Comments

@towfiqi
Copy link

towfiqi commented Apr 30, 2019

Rhubarb is a little slow, I think its becuase the time pocketSphinx take to recognize the words in the audio file. In my project, I will use google speech to text, which will output a something like this:

[
             {
                "startTime": "1.300s",
                "endTime": "1.400s",
                "word": "Four"
              },
              {
                "startTime": "1.400s",
                "endTime": "1.600s",
                "word": "score"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.600s",
                "word": "and"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.900s",
                "word": "twenty"
              }
]

What file/function do I look into inside this repo to see how rhubarb converts the words to mouth shapes directly. I mean what function does rhubarb uses to convert words to shapes eg:

0.00	A
0.05	B
0.63	C
0.70	B
0.84	F
0.98	D

I looked into rhubarb-lip-sync/rhubarb/src/core/ and could not figure it out. So Far I understand that, the words are being converted to "DARPA phonetic alphabet" first and then converted to mouth Shapes. eg: AA becomes Shape A, EH becomes shape C etc..

Can you kindly provide a high level overview of what process rhubarb follows to convert words to shapes?

Thanks

@DanielSWolf
Copy link
Owner

The best overview you'll find is in /src/lib/rhubarbLib.cpp. The first step is getting time-stamped phones; the second step is animating them.

Google won't give you the timing of individual phones, only words. So you'll have to enter at a lower level of code, somewhere within /src/recognition. You'll have to replace the part recognizing the words, but not the part converting words to phones and aligning them.

One problem I foresee is that Google STT won't reliably give you individual words. Try recognizing "$142". PocketSphinx will give you words: "one hundred and fourty two dollars". Last time I tried it, Google attempted to be clever and returned "$142". There are three problems here:

  • "$" is not a word, but a symbol. Rhubarb won't be able to convert that into phones.
  • "142" is not a word, but a series of digits. Same problem.
  • The order of $ and 142 is reversed compared to what is actually being said.

Unless Google changed their output, your approach will likely fail every time numbers, currencies, or dates are involved. If this is no longer the case, please let me know. I'ld love to integrate Google SST with Rhubarb.

@towfiqi
Copy link
Author

towfiqi commented Apr 30, 2019

The dollar issue can be mitigated by converting the $ to string "dollar" and the number to words with something like this: https://www.npmjs.com/package/written-number before sending it to Google STT.

As for the timestamp for each phones, I was planning to divide the duration with the number of shapes, This is how I plan to complete the whole process:

First get the result from Google/IBM speech to text and get the result. For example, the word Four is uttered in 0.1 seconds by Google STT, and the output is:

{ "startTime": "1.300s", "endTime": "1.400s", "word": "Four" }

Next, find the Arpabet for "Four" incmudict.0.7a, which is :F AO R .

Next, convert F AO R to X G E H X (5 shapes). Can be easily done, since each Shape correlates to only a few alphabets.

Then divide the duration (0.10) by 5 and give each shapes 0.02 seconds. like this:

0.00	X
0.02	B
0.04	C
0.06	B
0.08	F
0.10	D

I know the animation won't be perfect, but since, 95% characters built in Spine is 2d ish, it won't be that much noticeable..

If it works, the great thing about this would be that it can be easily used on the fly which my project depends on.

Let me know what you think.
Thanks

@DanielSWolf
Copy link
Owner

For simple cases, this should work. Beware, however, that there are many special cases that won't be covered. To list a few:

  • Most speech-to-text APIs will give you denormalized text. Dollars and numbers are just examples. In general, you'll need to normalize their output.
  • From the (few) tests I did on Google SST, their word timings are sometimes way off.
  • Words may not be in the dictionary.
  • Some phones within a word may be much longer than others.

At the end of the day, it all depends on your requirements. The more control you have over the aspects I mentioned, the less of a problem they may be.

@towfiqi
Copy link
Author

towfiqi commented May 3, 2019

I will keep that in mind. Can you kindly look at the below list and see if I have placed the mouth shapes correctly:

    /*A*/ ["P", "B", "M"],
    /*B*/ ["K", "S", "T", "EE", "IY", "IH"],
    /*C*/ ["EH", "AE", "AH", "Schwa", "EY", "AY", "HH", "G", "CH", "JH", "R", "Y"],
    /*D*/ ["AA"],
    /*E*/ ["AO", "ER", "SH", "ZH"],
    /*F*/ ["UW", "OW", "W", "UH", "AW", "OY"],
    /*G*/ ["F", "V"],
    /*H*/ ["L", "N", "NG", "T", "D", "TH", "DH", "S", "Z", "D"],

Thank You

@DanielSWolf
Copy link
Owner

Rhubarb's animation algorithm is more complex than a simple lookup. For a time-tested lookup table, I recommend you have a look at Papagayo or Papagayo-NG.

@towfiqi
Copy link
Author

towfiqi commented May 3, 2019

I will look into it. Thanks for all your help! 😃

@lukas-mertens
Copy link

@towfiqi Did you make any progress on this? I would be very interested in this as well, because it could be a great way on how to make rhubarb work with different languages (see #5).

@towfiqi
Copy link
Author

towfiqi commented Jun 15, 2019

@lukas-mertens The simple array lookup was suffice for my project. So I am using the method I described above. For different language just swap the cmudict.0.7a English dictionary and with other language dictionary. https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/fst/ .

@lukas-mertens
Copy link

I am actually doing that already right now. But it doesn't work that great, because the precision is rather bad. Sometimes the mouth moves when nothing is spoken and the other way around. Did you get any better results by using google speech to text? I tried out the API and got very good speech recognition results with my files (only about one wrong word every few sentences). That's why I was thinking about using this API by Google to get a precise timing of every word. Additionally I found out that espeak supports conversion of languages to IPA:

cat script.txt | espeak -q -v de --ipa > phonetics.txt

If I would have a lookup table from IPA to the mouth shapes by Rhubarb, I could get quite precise results I believe. @DanielSWolf I don't know how the integrated phonetics recognizer works, but could this probably be combined? Maybe you could use espeak to convert from text to IPA for many languages and use google to at least make the words match up with when the mouth moves.

@towfiqi
Copy link
Author

towfiqi commented Jun 15, 2019

You can try IBM watson speech to text, you can get exact timestamp for each word which google speech to text doesnt.

And rhubarb uses pocketsphinx which not that great. It may output good result if the spoken word in audio is good quality and pronunciation is great.

For my project, my requirement is actually to get text to mouth shape animations. Which is why I dont actually have to output audio to text first. I already have the text.

If you could share your method of text to mouth shapes, that would be great!! Because on my try I got lots of close mouth in between words which made the animation look bad.

Edit: I just checked our link. looks like Google STT outputs timestamp, which I missed when I was researching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants