Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

real-time example #2

Open
alirezag opened this issue Oct 19, 2018 · 6 comments
Open

real-time example #2

alirezag opened this issue Oct 19, 2018 · 6 comments

Comments

@alirezag
Copy link

alirezag commented Oct 19, 2018

Hi, I'm new to WORLD. IT is obviously an awesome software but I was wondering how I can use it in realtime, since it is the main point of the original paper. I'm doing some like this now, but the result is very choppy:

def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)
    
    # analysis
    dat = vocoder.encode(fs, x,f0_method ='harvest')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis
    

    dat = vocoder.scale_pitch(dat, 1.5)

    dat = vocoder.decode(dat)
    return (dat['out'] * 2 ** 15).astype(np.int16)

i'm applying the function to 1024 bytes of the stream that I get. any ideas how I can improve?

@tuanad121
Copy link
Owner

tuanad121 commented Oct 19, 2018

Hi Alireza, Thanks for asking question. It's interesting. ^^
Would you my mind elaborate on how your result sounds? Honestly, I never try my version realtime, I'm not sure how it behaves? ^^
My guess is because of WORLD uses pitch-synchronous windows. When a pitch is low, then its corresponding window length is broad. I'm not sure what happens if a window length is higher than the input length.
Another thing is Python is not as optimal as C with for loop. So my Python version is slower than original C version. The Harverst module for F0 extraction is the slowest one while other modules are quite fast. We can use faster F0 extraction methods (e.g. set f0_method='dio'. ) instead. So I haven't thought about realtime processing yet ^^. I will take a look on the original work to see what they did ^^.

@alirezag
Copy link
Author

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Here is the new code:


def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)
    
    # analysis
    dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis
    
    if 1:  # global pitch scaling
        dat = vocoder.scale_pitch(dat, 1)
    if 0:  # global duration scaling
        dat = vocoder.scale_duration(dat, 2)
    if 0:  # fine-grained duration modification
        vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1])  # TODO: look into this
    dat = vocoder.decode(dat)
    return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)

here is how I loop over the audio:

while data:  
    # unpack the raw data
    x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))

    # apply the conversion
    rx_int16 = apply(vocoder,fs,x_int16)
    
    # pack the reconstructed data
    rdata =  wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))
    
    # write to audio stream
    stream.write(rdata)  
    
    # write to file
    fw.writeframes(rdata)
    
    # read the next chunk
    data = f.readframes(chunk)  
    x_int16 = np.fromstring(data,dtype=np.int16)

Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.

here is the original audio:

test-mwm.zip

here is the reconstructed audio:

test-mwm-resyn.zip

1 similar comment
@alirezag
Copy link
Author

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good. but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Here is the new code:


def apply(vocoder,fs,x_int16):
    #fs, x_int16 = wavread(wav_path)
    x = x_int16 / (2 ** 15 - 1)
    
    # analysis
    dat = vocoder.encode(fs, x,f0_method ='dio')# f0_method='harvest', is_requiem=False) # use requiem analysis and synthesis
    
    if 1:  # global pitch scaling
        dat = vocoder.scale_pitch(dat, 1)
    if 0:  # global duration scaling
        dat = vocoder.scale_duration(dat, 2)
    if 0:  # fine-grained duration modification
        vocoder.modify_duration(dat, [1, 1.5], [0, 1, 3, -1])  # TODO: look into this
    dat = vocoder.decode(dat)
    return (dat['out'] * (2 ** 15 - 1)).astype(np.int16)

here is how I loop over the audio:

while data:  
    # unpack the raw data
    x_int16 = np.array(wave.struct.unpack("%dh"%(len(data)/swidth), data))

    # apply the conversion
    rx_int16 = apply(vocoder,fs,x_int16)
    
    # pack the reconstructed data
    rdata =  wave.struct.pack("%dh"%(len(rx_int16)), *list(rx_int16))
    
    # write to audio stream
    stream.write(rdata)  
    
    # write to file
    fw.writeframes(rdata)
    
    # read the next chunk
    data = f.readframes(chunk)  
    x_int16 = np.fromstring(data,dtype=np.int16)

Note that i'm just using the pitch shifter and not really changing pitch. just trying to see if the approach of applying this to chunks work. I really need to apply this In realtime.

here is the original audio:

test-mwm.zip

here is the reconstructed audio:

test-mwm-resyn.zip

@tuanad121
Copy link
Owner

tuanad121 commented Oct 23, 2018

hmm, thanks for the comment, so looks like when I switched to dio it is fast enough for realtime, but quality of reconstruction is not good.

You're very right that DIO doesn't work as well as Harvest algorithm. The reason is that DIO sometimes misclassifies voiced/unvoiced (V/UV) frames. In unvoiced frames, F0 is 0 and excitation signal is set to noise. Would you mind trying f0_method = 'swipe'? I realized Harvest is slow and Dio is not as good, so I support another algorithm called Swipe. Hopefully, it's helpful.

but I don't really know about acoustics and voice so if you have ideas about how to improve that would be great. my guess is running the algorithm on 2048 long chunks and then piecing them together doesn't work very well probably because on the boundary of each chunk there are residuals that don't match up nicely with the succeeding chunks.

Mmm, I will take a look at the program and come back soon.

@alirezag
Copy link
Author

Thanks @tuanad121 . This is the error I get when I switch to swipe:

  File "c:\github-temmp\Python-WORLD\world\cheaptrick.py", line 88, in calculate_windowed_waveform
    half_window_length = int(1.5 * fs / f0 + 0.5)
ValueError: cannot convert float NaN to integer

I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.

Here is how I'm calling vocoder:

wav_path = Path('get-samples/test-mwm.wav')
fs, x_int16 = wavread(wav_path)
x = x_int16 / (2 ** 15 - 1)

vocoder = main.World()

# analysis
dat = vocoder.encode(fs, x, f0_method='swipe', is_requiem=False) # use requiem analysis and synthesis

@tuanad121
Copy link
Owner

I looked into it f0 is coming up NaN. The function swip() returns an array of NaN for f0 attribute of its return type. The input file is the test-mwm.zip that I included above.

Thanks @alirezag for pointing it out. I have fixed the problem in Swipe. Basically, the Swipe uses NaN to identify unvoiced frames, while WORLD uses zeros to identify the frames. I failed to set the NaN to zero for the output of Swipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants