What are the inputs to the lip & audio models? #1

taewookim · 2018-04-09T14:11:01Z

Thanks for this awesome repo @voletiv

Question about the input data shape

you mentioned

SyncNet takes input images of size (112, 112, 5).

What is 5 for? in standard cv2 imread.. it's usually (width, height, channels).. channels would be 3 (R,G,B).. but why 5?

Is the image supposed to be the entire image? Or entire face in the image? Or just the lips (as region of interest) as determined by, say, dlib landmarks?
I am completely new to audio processing so I am not even sure where to begin . What exactly am I supposed to pass to the model? I read the syncnet paper, but im still a bit confused

As always, sample code always appreciated

edwinter · 2018-04-09T16:53:47Z

What is 5 for? in standard cv2 imread.. it's usually (width, height, channels).. channels would be 3 (R,G,B).. but why 5?

5 images across 0.2 seconds (25fps video)
From the Chung paper you can see:

2.2 Visual stream
Representation. The input format to the visual network is a sequence of mouth
regions as grayscale images, as shown in Figure 1. The input dimensions are
111×111×5 (W×H×T) for 5 frames, which corresponds to 0.2-seconds at the
25Hz frame rate.

Not sure why it says 111 but assuming a typo.

Is the image supposed to be the entire image? Or entire face in the image? Or just the lips (as region of interest) as determined by, say, dlib landmarks?

On page 3 of the Chung SyncNet paper you can see an example. It's a mouth crop but is somewhat large. You might want to use dlib landmarks to crop properly.

I am completely new to audio processing so I am not even sure where to begin . What exactly am I supposed to pass to the model? I read the syncnet paper, but im still a bit confused

Inputs are the grayscale mouth crops + MFCC image I believe.

voletiv · 2018-04-09T17:27:42Z

Here's the Syncnet architecture:

In my repo, I load the two branches separately - audio model, and lips model.

As can be seen in the lips model, the input to the lips model is 120x120x5 (but I take 112x112x5), i.e. 5 mouth images.

What is 5 for? in standard cv2 imread.. it's usually (width, height, channels).. channels would be 3 (R,G,B).. but why 5?

The point is to extract a good feature of mouth movement. So we take 5 continuous frames (with some mouth movement) and make one feature of them. Thus, the input size to the lips model has 5 channels.

Is the image supposed to be the entire image? Or entire face in the image? Or just the lips (as region of interest) as determined by, say, dlib landmarks?

The procedure to extract the mouth images is described in detail here: https://github.com/voletiv/lipreading-in-the-wild-experiments/tree/master/process-lrw

The course procedure is:
Extract dlib landmarks -> Note the mouth bounding box as the bounding box of the mouth landmarks -> Make the bounding box square -> Expand the bounding box to ~0.7 times the width of the face -> Take the image within the bounding box -> Make it grayscale -> Resize to 112x112 -> Save that as the mouth image.

You can see an example of a mouth image in the picture above.

I am completely new to audio processing so I am not even sure where to begin . What exactly am I supposed to pass to the model? I read the syncnet paper, but im still a bit confused

The following is mentioned in the paper:

13 frequency bands are used at each time step. The features are computed at a sampling rate of 100Hz, giving 20 time steps for a 0.2-second input signal

I have used a library called speechpy to extract MFCC features. The function to extract MFCC features from a .wav file according to the instructions using speechpy is:

speechpy.feature.mfcc(signal, sampling_frequency, frame_length=0.010, frame_stride=0.010, num_cepstral=13)

Audio features are computed over a duration of audio. In the paper, it is mentioned that features are computed at 100 Hz => for every 0.010 seconds. Hence, frame_length=0.010, frame_stride=0.010 (no overlap).

According to the paper, audio features and video features are extracted for every 0.2 seconds.
Lip: 0.2 seconds => 0.2 * 25fps = 5 video frames
Audio: 0.2 seconds => 0.2 / 0.01(frame duration) = 20 audio frames

Hence, a 112x112x5 matrix is input to the lips model, and a 13x20 matrix is input to the audio model.

edwinter · 2018-04-09T21:32:27Z

@voletiv Thanks for the details.
For the MFCC input, did you also follow the paper and reflect the top/bottom 3 rows of the image to reduce boundary effects?

michiyosony · 2018-04-10T07:40:49Z

To add to @edwinter's question, if you did do the

The top and bottom three rows of the image or reflected to reduce boundary effects
would that mean swapping rows 0 and 12, swapping rows 1 and 11, and swapping rows 2 and 10? Or is it a horizontal reflection of the values in each of those rows? Or something else entirely?

calculate_euclidian_distance is giving me values between 10 and 11 regardless of whether I feed it aligned audio/mouthcrops or unsynced audio/mouthcrops. This makes me think I'm creating the mfcc png incorrectly. Or I could be creating the mouth crop incorrectly, but that's easier to visually verify that it looks like it should. What values do you get when feeding synced or unsynced data, @voletiv?

voletiv · 2018-04-10T07:48:33Z

@edwinter @michiyosony

I haven't used Syncnet for audio-video sync, I used it to get good video features for lipreading and other tasks.

I'm not sure about the reflection part, it is mentioned in the paper but the heatmap representation in Figure 1 does not show it, and the input to the network is 13x20. So not sure how they implemented that.

@michiyosony What you should do is calculate the contrastive loss (maybe euclidean distance is not as good for this task) for synced and un-synced videos in your dataset, and then decide the threshold.

As mentioned earlier, Syncnet has been trained on a specific dataset (LRW, i.e. BBC videos), so it might not be a perfect fit for your data. But the underlying principle is the same, so all that changes is the thresholds and actual encoding values.

taewookim · 2018-04-10T09:35:47Z

@michiyosony can you share your code?

michiyosony · 2018-04-10T15:47:17Z

I used it to get good video features for lipreading and other tasks.

Does that mean you are able to use syncnet to identify the active speaker because you get lower euclidean distances for the active speaker than for other faces in view? I am currently getting values that aren't useable for that. For example, I might get 10.1 on a correct audio/visual pair, and 10.2 on the same video with audio from a second later in the clip (false pair). Or the false pair might score 10.1 and the real pair 10.2. If you're able to get meaningful-enough scores to differentiate speakers from non-speakers, I'll take that as a sign that I'm doing something(s) wrong!

Syncnet has been trained on a specific dataset (LRW, i.e. BBC videos), so it might not be a perfect fit for your data.

That's a good point. I'm trying to use it on the GRID dataset. I thought of the model as being good at generalizing since the paper talks about good results on foreign language videos and the Columbia dataset, but trying it on one of the mentioned datasets is worth a try!

What you should do is calculate the contrastive loss (maybe euclidean distance is not as good for this task)

Can you explain idea this a bit more? My understanding (I'm new to machine learning) is that a loss function is used during training to understand how well the model is predicting and helping it improve (in this case, improve by minimizing the Euclidean distance for genuine pairs and maximizing it for false pairs).
If we look at the equation for the loss function

it looks like you need to know y (whether the video is synced or not) to be able to use it, which would make sense during training but not when trying to use the model.

michiyosony · 2018-04-12T17:52:22Z

@taewookim mfcc image generating code looks like this, but the pngs it creates don't produce meaningful results when fed into syncnet

#import [a mfcc library like speechpy or python_speech_features]
import scipy.io.wavfile as wav
from PIL import Image
from file_util import createDirIfNotExist


# max number of frames in each output
# each output should contain 0.2sec worth of mfcc
EACH_MFCC_OUTPUT_FRAME_SIZE = 20


def extract_mfcc_series(wav_file, target_dir):
    createDirIfNotExist(target_dir)
    (rate, sig) = wav.read(wav_file)

    try:
        mfcc_feat = mfcc(sig, rate, [other arguments go here])
    except IndexError:
        print("index error occurred while extracting mfcc")
        return
    print('sample_rate: {}, mfcc_feat length: {}, mfcc_feat[0] length: {}'.format(rate, len(mfcc_feat), len(mfcc_feat[0])))
    num_output = len(mfcc_feat) / EACH_MFCC_OUTPUT_FRAME_SIZE
    num_output += 1 if (len(mfcc_feat) % EACH_MFCC_OUTPUT_FRAME_SIZE > 0) else 0
    for index in xrange(num_output):
        img = Image.new('RGB', (20, 13), "black")
        pixels = img.load()
        for i in range(img.size[0]):
            for j in range(img.size[1]):
                frame_index = index * EACH_MFCC_OUTPUT_FRAME_SIZE + i
                try:
                    if mfcc_feat[frame_index][j] < 0:
                        red_amount = min(255, 255 * (mfcc_feat[frame_index][j] / -20))
                        pixels[i, j] = (int(red_amount), 0, 0)
                    elif (mfcc_feat[frame_index][j] > 0):
                        blue_amount = min(255, 255 * (mfcc_feat[frame_index][j] / 20))
                        pixels[i, j] = (0, 0, int(blue_amount))
                except IndexError:
                    print("index error occurred while extracting mfcc @ " + str(frame_index) + "," + str(j))
                    break
        img.save("{}/mfcc_{:03d}.png".format(target_dir, index), 'PNG')

Note that you need to import a library that provides you with a mfcc function and call it appropriately.

taewookim · 2018-04-15T05:45:53Z

thank you @michiyosony

Couple of issues..
I am using speechpy's mfcc feature extraction:

mfcc_feat = speechpy.feature.mfcc(signal, sampling_frequency=fs, frame_length=0.010, frame_stride=0.01)

Couple of questions

You're using (20, 13) but according to @voletiv, you need (13,20)...?
You're ignoring indexing error ... will this have any problems w/lost information in the input data to the audio model?

Entire block:

def extract_mfcc_series(wav_file, target_dir):
	(rate, sig) = wav.read(wav_file)

	try:
		mfcc_feat = speechpy.feature.mfcc(sig, sampling_frequency=rate, frame_length=0.010, frame_stride=0.01)
	except IndexError:
		print("index error occurred while extracting mfcc")
		return
	print('sample_rate: {}, mfcc_feat length: {}, mfcc_feat[0] length: {}'.format(rate, len(mfcc_feat), len(mfcc_feat[0])))
	num_output = len(mfcc_feat) / EACH_MFCC_OUTPUT_FRAME_SIZE
	num_output += 1 if (len(mfcc_feat) % EACH_MFCC_OUTPUT_FRAME_SIZE > 0) else 0
	
	# print(mfcc_feat.shape)
	# input(int(num_output))
	for index in range(int(num_output)):
		img = Image.new('RGB', (20, 13), "black")
		pixels = img.load()
		for i in range(img.size[0]):
			for j in range(img.size[1]):
				frame_index = index * EACH_MFCC_OUTPUT_FRAME_SIZE + i
				# print(frame_index)
				try:
					if mfcc_feat[frame_index][j] < 0:
						red_amount = min(255, 255 * (mfcc_feat[frame_index][j] / -20))
						pixels[i, j] = (int(red_amount), 0, 0)
					elif (mfcc_feat[frame_index][j] > 0):
						blue_amount = min(255, 255 * (mfcc_feat[frame_index][j] / 20))
						pixels[i, j] = (0, 0, int(blue_amount))
				except IndexError:
					print("index error occurred while extracting mfcc @ " + str(frame_index) + "," + str(j))
					break
		img.save("{}/mfcc_{:03d}.png".format(target_dir, index), 'PNG')

taewookim · 2018-04-27T07:40:54Z

I have somewhat working version of syncnet in keras.. but euclidian distance seems to be completely random (thanks to @voletiv - he patched up where I couldn't figure out)

https://github.com/taewookim/syncnet-keras/blob/master/syncnet-runner.ipynb

in the "test" folder.. i added a whole bunch of unsynced video clips

basically english dubbed chinese movies.. where the lip and audio dont match

they euclidian distance seems to be random.. most of them (i.e. unsynced) videos do seem to converge to some scalar threshold value (seems be under 50 if unsynced), but there was one video w/multiple speakers that had over 80.

If anyone can see where we went wrong, would appreciate feedback

taewookim mentioned this issue Apr 12, 2018

Correct wav format? astorfi/speechpy#14

Closed

DomhnallBoyle mentioned this issue Apr 29, 2021

Input to SyncNet #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the inputs to the lip & audio models? #1

What are the inputs to the lip & audio models? #1

taewookim commented Apr 9, 2018

edwinter commented Apr 9, 2018

voletiv commented Apr 9, 2018 •

edited

edwinter commented Apr 9, 2018

michiyosony commented Apr 10, 2018

voletiv commented Apr 10, 2018

taewookim commented Apr 10, 2018

michiyosony commented Apr 10, 2018

michiyosony commented Apr 12, 2018

taewookim commented Apr 15, 2018 •

edited

taewookim commented Apr 27, 2018

What are the inputs to the lip & audio models? #1

What are the inputs to the lip & audio models? #1

Comments

taewookim commented Apr 9, 2018

edwinter commented Apr 9, 2018

voletiv commented Apr 9, 2018 • edited

edwinter commented Apr 9, 2018

michiyosony commented Apr 10, 2018

voletiv commented Apr 10, 2018

taewookim commented Apr 10, 2018

michiyosony commented Apr 10, 2018

michiyosony commented Apr 12, 2018

taewookim commented Apr 15, 2018 • edited

taewookim commented Apr 27, 2018

voletiv commented Apr 9, 2018 •

edited

taewookim commented Apr 15, 2018 •

edited