Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

towfiqi · 2019-04-30T03:28:45Z

Rhubarb is a little slow, I think its becuase the time pocketSphinx take to recognize the words in the audio file. In my project, I will use google speech to text, which will output a something like this:

[
             {
                "startTime": "1.300s",
                "endTime": "1.400s",
                "word": "Four"
              },
              {
                "startTime": "1.400s",
                "endTime": "1.600s",
                "word": "score"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.600s",
                "word": "and"
              },
              {
                "startTime": "1.600s",
                "endTime": "1.900s",
                "word": "twenty"
              }
]

What file/function do I look into inside this repo to see how rhubarb converts the words to mouth shapes directly. I mean what function does rhubarb uses to convert words to shapes eg:

0.00	A
0.05	B
0.63	C
0.70	B
0.84	F
0.98	D

I looked into rhubarb-lip-sync/rhubarb/src/core/ and could not figure it out. So Far I understand that, the words are being converted to "DARPA phonetic alphabet" first and then converted to mouth Shapes. eg: AA becomes Shape A, EH becomes shape C etc..

Can you kindly provide a high level overview of what process rhubarb follows to convert words to shapes?

Thanks

The text was updated successfully, but these errors were encountered:

DanielSWolf · 2019-04-30T07:41:52Z

The best overview you'll find is in /src/lib/rhubarbLib.cpp. The first step is getting time-stamped phones; the second step is animating them.

Google won't give you the timing of individual phones, only words. So you'll have to enter at a lower level of code, somewhere within /src/recognition. You'll have to replace the part recognizing the words, but not the part converting words to phones and aligning them.

One problem I foresee is that Google STT won't reliably give you individual words. Try recognizing "$142". PocketSphinx will give you words: "one hundred and fourty two dollars". Last time I tried it, Google attempted to be clever and returned "$142". There are three problems here:

"$" is not a word, but a symbol. Rhubarb won't be able to convert that into phones.
"142" is not a word, but a series of digits. Same problem.
The order of $ and 142 is reversed compared to what is actually being said.

Unless Google changed their output, your approach will likely fail every time numbers, currencies, or dates are involved. If this is no longer the case, please let me know. I'ld love to integrate Google SST with Rhubarb.

towfiqi · 2019-04-30T08:34:00Z

The dollar issue can be mitigated by converting the $ to string "dollar" and the number to words with something like this: https://www.npmjs.com/package/written-number before sending it to Google STT.

As for the timestamp for each phones, I was planning to divide the duration with the number of shapes, This is how I plan to complete the whole process:

First get the result from Google/IBM speech to text and get the result. For example, the word Four is uttered in 0.1 seconds by Google STT, and the output is:

{ "startTime": "1.300s", "endTime": "1.400s", "word": "Four" }

Next, find the Arpabet for "Four" incmudict.0.7a, which is :F AO R .

Next, convert F AO R to X G E H X (5 shapes). Can be easily done, since each Shape correlates to only a few alphabets.

Then divide the duration (0.10) by 5 and give each shapes 0.02 seconds. like this:

0.00	X
0.02	B
0.04	C
0.06	B
0.08	F
0.10	D

I know the animation won't be perfect, but since, 95% characters built in Spine is 2d ish, it won't be that much noticeable..

If it works, the great thing about this would be that it can be easily used on the fly which my project depends on.

Let me know what you think.
Thanks

DanielSWolf · 2019-05-02T19:11:31Z

For simple cases, this should work. Beware, however, that there are many special cases that won't be covered. To list a few:

Most speech-to-text APIs will give you denormalized text. Dollars and numbers are just examples. In general, you'll need to normalize their output.
From the (few) tests I did on Google SST, their word timings are sometimes way off.
Words may not be in the dictionary.
Some phones within a word may be much longer than others.

At the end of the day, it all depends on your requirements. The more control you have over the aspects I mentioned, the less of a problem they may be.

towfiqi · 2019-05-03T01:11:42Z

I will keep that in mind. Can you kindly look at the below list and see if I have placed the mouth shapes correctly:

    /*A*/ ["P", "B", "M"],
    /*B*/ ["K", "S", "T", "EE", "IY", "IH"],
    /*C*/ ["EH", "AE", "AH", "Schwa", "EY", "AY", "HH", "G", "CH", "JH", "R", "Y"],
    /*D*/ ["AA"],
    /*E*/ ["AO", "ER", "SH", "ZH"],
    /*F*/ ["UW", "OW", "W", "UH", "AW", "OY"],
    /*G*/ ["F", "V"],
    /*H*/ ["L", "N", "NG", "T", "D", "TH", "DH", "S", "Z", "D"],

Thank You

DanielSWolf · 2019-05-03T07:26:05Z

Rhubarb's animation algorithm is more complex than a simple lookup. For a time-tested lookup table, I recommend you have a look at Papagayo or Papagayo-NG.

towfiqi · 2019-05-03T15:27:49Z

I will look into it. Thanks for all your help! 😃

lukas-mertens · 2019-06-14T21:58:50Z

@towfiqi Did you make any progress on this? I would be very interested in this as well, because it could be a great way on how to make rhubarb work with different languages (see #5).

towfiqi · 2019-06-15T01:47:31Z

@lukas-mertens The simple array lookup was suffice for my project. So I am using the method I described above. For different language just swap the cmudict.0.7a English dictionary and with other language dictionary. https://sourceforge.net/projects/cmusphinx/files/G2P%20Models/fst/ .

lukas-mertens · 2019-06-15T08:33:10Z

I am actually doing that already right now. But it doesn't work that great, because the precision is rather bad. Sometimes the mouth moves when nothing is spoken and the other way around. Did you get any better results by using google speech to text? I tried out the API and got very good speech recognition results with my files (only about one wrong word every few sentences). That's why I was thinking about using this API by Google to get a precise timing of every word. Additionally I found out that espeak supports conversion of languages to IPA:

cat script.txt | espeak -q -v de --ipa > phonetics.txt

If I would have a lookup table from IPA to the mouth shapes by Rhubarb, I could get quite precise results I believe. @DanielSWolf I don't know how the integrated phonetics recognizer works, but could this probably be combined? Maybe you could use espeak to convert from text to IPA for many languages and use google to at least make the words match up with when the mouth moves.

towfiqi · 2019-06-15T12:56:02Z

You can try IBM watson speech to text, you can get exact timestamp for each word which google speech to text doesnt.

And rhubarb uses pocketsphinx which not that great. It may output good result if the spoken word in audio is good quality and pronunciation is great.

For my project, my requirement is actually to get text to mouth shape animations. Which is why I dont actually have to output audio to text first. I already have the text.

If you could share your method of text to mouth shapes, that would be great!! Because on my try I got lots of close mouth in between words which made the animation look bad.

Edit: I just checked our link. looks like Google STT outputs timestamp, which I missed when I was researching.

DanielSWolf added the question label Jun 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

towfiqi commented Apr 30, 2019 •

edited

DanielSWolf commented Apr 30, 2019

towfiqi commented Apr 30, 2019 •

edited

DanielSWolf commented May 2, 2019

towfiqi commented May 3, 2019

DanielSWolf commented May 3, 2019

towfiqi commented May 3, 2019

lukas-mertens commented Jun 14, 2019

towfiqi commented Jun 15, 2019 •

edited

lukas-mertens commented Jun 15, 2019

towfiqi commented Jun 15, 2019 •

edited

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

Directly outputting Words/timestamp to Mouth Shapes bypassing pocketSphinx #73

Comments

towfiqi commented Apr 30, 2019 • edited

DanielSWolf commented Apr 30, 2019

towfiqi commented Apr 30, 2019 • edited

DanielSWolf commented May 2, 2019

towfiqi commented May 3, 2019

DanielSWolf commented May 3, 2019

towfiqi commented May 3, 2019

lukas-mertens commented Jun 14, 2019

towfiqi commented Jun 15, 2019 • edited

lukas-mertens commented Jun 15, 2019

towfiqi commented Jun 15, 2019 • edited

towfiqi commented Apr 30, 2019 •

edited

towfiqi commented Apr 30, 2019 •

edited

towfiqi commented Jun 15, 2019 •

edited

towfiqi commented Jun 15, 2019 •

edited