-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the inputs to the lip & audio models? #1
Comments
5 images across 0.2 seconds (25fps video)
Not sure why it says 111 but assuming a typo.
On page 3 of the Chung SyncNet paper you can see an example. It's a mouth crop but is somewhat large. You might want to use dlib landmarks to crop properly.
Inputs are the grayscale mouth crops + MFCC image I believe. |
Here's the Syncnet architecture: In my repo, I load the two branches separately - audio model, and lips model. As can be seen in the lips model, the input to the lips model is 120x120x5 (but I take 112x112x5), i.e. 5 mouth images.
The point is to extract a good feature of mouth movement. So we take 5 continuous frames (with some mouth movement) and make one feature of them. Thus, the input size to the lips model has 5 channels.
The procedure to extract the mouth images is described in detail here: https://github.com/voletiv/lipreading-in-the-wild-experiments/tree/master/process-lrw The course procedure is: You can see an example of a mouth image in the picture above.
The following is mentioned in the paper:
I have used a library called speechpy to extract MFCC features. The function to extract MFCC features from a .wav file according to the instructions using speechpy is:
Audio features are computed over a duration of audio. In the paper, it is mentioned that features are computed at 100 Hz => for every 0.010 seconds. Hence, frame_length=0.010, frame_stride=0.010 (no overlap). According to the paper, audio features and video features are extracted for every 0.2 seconds. Hence, a 112x112x5 matrix is input to the lips model, and a 13x20 matrix is input to the audio model. |
@voletiv Thanks for the details. |
To add to @edwinter's question, if you did do the
|
I haven't used Syncnet for audio-video sync, I used it to get good video features for lipreading and other tasks. I'm not sure about the reflection part, it is mentioned in the paper but the heatmap representation in Figure 1 does not show it, and the input to the network is 13x20. So not sure how they implemented that. @michiyosony What you should do is calculate the contrastive loss (maybe euclidean distance is not as good for this task) for synced and un-synced videos in your dataset, and then decide the threshold. As mentioned earlier, Syncnet has been trained on a specific dataset (LRW, i.e. BBC videos), so it might not be a perfect fit for your data. But the underlying principle is the same, so all that changes is the thresholds and actual encoding values. |
@michiyosony can you share your code? |
Does that mean you are able to use syncnet to identify the active speaker because you get lower euclidean distances for the active speaker than for other faces in view? I am currently getting values that aren't useable for that. For example, I might get 10.1 on a correct audio/visual pair, and 10.2 on the same video with audio from a second later in the clip (false pair). Or the false pair might score 10.1 and the real pair 10.2. If you're able to get meaningful-enough scores to differentiate speakers from non-speakers, I'll take that as a sign that I'm doing something(s) wrong!
That's a good point. I'm trying to use it on the GRID dataset. I thought of the model as being good at generalizing since the paper talks about good results on foreign language videos and the Columbia dataset, but trying it on one of the mentioned datasets is worth a try!
Can you explain idea this a bit more? My understanding (I'm new to machine learning) is that a loss function is used during training to understand how well the model is predicting and helping it improve (in this case, improve by minimizing the Euclidean distance for genuine pairs and maximizing it for false pairs). it looks like you need to know |
@taewookim mfcc image generating code looks like this, but the pngs it creates don't produce meaningful results when fed into syncnet
Note that you need to import a library that provides you with a mfcc function and call it appropriately. |
thank you @michiyosony Couple of issues..
Couple of questions
Entire block:
|
I have somewhat working version of syncnet in keras.. but euclidian distance seems to be completely random (thanks to @voletiv - he patched up where I couldn't figure out) https://github.com/taewookim/syncnet-keras/blob/master/syncnet-runner.ipynb in the "test" folder.. i added a whole bunch of unsynced video clips basically english dubbed chinese movies.. where the lip and audio dont match they euclidian distance seems to be random.. most of them (i.e. unsynced) videos do seem to converge to some scalar threshold value (seems be under 50 if unsynced), but there was one video w/multiple speakers that had over 80. If anyone can see where we went wrong, would appreciate feedback |
Thanks for this awesome repo @voletiv
Question about the input data shape
What is 5 for? in standard cv2 imread.. it's usually (width, height, channels).. channels would be 3 (R,G,B).. but why 5?
Is the image supposed to be the entire image? Or entire face in the image? Or just the lips (as region of interest) as determined by, say, dlib landmarks?
I am completely new to audio processing so I am not even sure where to begin . What exactly am I supposed to pass to the model? I read the syncnet paper, but im still a bit confused
As always, sample code always appreciated
The text was updated successfully, but these errors were encountered: