SVD features processing #9

simonefrancia · 2020-04-24T10:07:54Z

Hi @kyungyunlee ,
thanks for your repo and your ideas.
I am new in the field of audio processing and I would like to know how features are treated in the preprocessing phase in the specific case of CNN training for SVD, and how prediction can be associated to input features in order to apply mask ( 0 if no_voice , 1 if voice) and recreate audio of same length where it can be possible to hear voice if prediction is VOICE, and no audio if prediction is NO_VOICE.

SR = 22050
FRAME_LEN = 1024
HOP_LENGTH = 315
CNN_INPUT_SIZE = 115 # 1.6 sec
CNN_OVERLAP = 5 # Hopsize of 5 for training, 1 for inference
N_MELS = 80
CUTOFF = 8000 # fmax = 8kHz

Step: Load an audio file as a floating point time series.

y, _ = librosa.load("audio.mp3", sr=22050)

From this I get 4280832 samples.

Step: Mel Spectrogram

x = log_melgram(y, SR=22050, FRAME_LEN=1024, HOP_LENGTH=315, N_MELS=80, 27.5, 8000)

Size of x results (80, 13590).

Step: Segmentation

for i in range(0, x.shape[1] - CNN_INPUT_SIZE, 1):
    x_segment = x[:, i: i + CNN_INPUT_SIZE]
    # pick the center frame label 
    total_x.append(x_segment)

# Normalization steps
X = (total_x - mean) / std
X = np.expand_dims(X, axis=3)

After this step total_x has shape (13475, 80, 115,1)

So these are the main steps in order to get X that is fed to the network.
What is not so clear to me is the transition between Step 2 and Step 3 , so why (80, 13590) dimension becomes (13475, 80, 115) and what is its meaning.
That is the keypoint I think also to understand how return back to audio where I can apply SVD prediction of the network and build an audio with SVD and with the same original shape.
VAD prediction has shape (13475, 1).

Thank you very much

The text was updated successfully, but these errors were encountered:

kyungyunlee · 2020-04-24T15:00:52Z

Hi :)
So the idea behind this singing voice detection system is to determine whether there is a singing voice or not for each input segment (which is 1.6 seconds of audio). This means that if we want to analyze a longer audio file, we need to divide it into 1.6-second long segments. Step 3 is where I do this : just looping through the time dimension of the melspectrogram and cutting them into 1.6-second segments. At the end, there are 13475 segments of size (80, 115).

After running the prediction, you will be able to identify at which frames the prediction is 0 or 1. From here, you need to convert frames to seconds using the function like frames_to_time.

I hope this helps. Let me know if there are more questions!

simonefrancia · 2020-04-30T13:46:04Z

Hi @kyungyunlee, thanks for your response.
So is assigned one label for every 1.6 seconds?
Because what I don't understand is why if I have an audio with duration 194 seconds ( that is the duration of the example above) , as result of the preprocessing I have 13475 segments of 1.6 second for each one.
Thanks

kyungyunlee · 2020-04-30T14:40:41Z

@simonefrancia Hi, yes it's single binary label for 1.6 seconds, but there is overlap during training so there will be more than 194/1.6 segments, for instance.

simonefrancia · 2020-04-30T15:13:26Z

Ok. But is it possible to be more precise, for example if I want to have a prediction for every 100ms ?
is it possible to do this, only changing some config of training?
Thank you

kyungyunlee · 2020-04-30T15:27:31Z

@simonefrancia Sure, but I think 100ms is way too short to determine if the input contains singing voice or not. The big characteristic to detect is vibrato and 100ms doesn't seem long enough to detect vibrato. Typically the input is around 1 second and it makes sense in human's perspective as well. Feel free to try :)

simonefrancia · 2020-05-04T15:11:40Z

@kyungyunlee according to you, what is the smallest duration of segment we can use for training?
Thanks

kyungyunlee · 2020-05-08T05:30:15Z

I am not sure, since I haven’t tried using shorter input. I think at this point you have to define it in terms of your task goal

simonefrancia changed the title ~~SVD~~ SVD features processing Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVD features processing #9

SVD features processing #9

simonefrancia commented Apr 24, 2020 •

edited

kyungyunlee commented Apr 24, 2020

simonefrancia commented Apr 30, 2020

kyungyunlee commented Apr 30, 2020

simonefrancia commented Apr 30, 2020

kyungyunlee commented Apr 30, 2020

simonefrancia commented May 4, 2020 •

edited

kyungyunlee commented May 8, 2020

SVD features processing #9

SVD features processing #9

Comments

simonefrancia commented Apr 24, 2020 • edited

kyungyunlee commented Apr 24, 2020

simonefrancia commented Apr 30, 2020

kyungyunlee commented Apr 30, 2020

simonefrancia commented Apr 30, 2020

kyungyunlee commented Apr 30, 2020

simonefrancia commented May 4, 2020 • edited

kyungyunlee commented May 8, 2020

simonefrancia commented Apr 24, 2020 •

edited

simonefrancia commented May 4, 2020 •

edited