Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [ICML 2024]

Project page | Arxiv | Text-Based Space

This repository contains the official code release for Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion.

Requirements

python -m pip install -r requirements.txt

Usage Example

Supported models are AudioLDM, TANGO, and AudioLDM2. For unsupervised editing, Stable Diffusion is also supported.

Text-Based Editing

CUDA_VISIBLE_DEVICES=<gpu_num> python main_run.py --cfg_tar <target_cfg_strength> --cfg_src <source_cfg_strength> --init_aud <input_audio_path> --target_prompt <description of the wanted edited signal> --tstart <edit from timestep> --model_id <model_name> --results_path <path to dump results>

You can supply a source prompt that describes the original audio by using --source_prompt.
Use python main_run.py --help for all options.

use --mode ddim to run DDIM inversion and editing (note that --tstart must be equal to num_diffusion_steps (by default set to 200)).

Unsupervised Editing

First extract the PCs for your wanted timesteps:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_pc_extract_inv.py  --init_aud <input_audio_path> --model_id <model_name> --results_path <path to dump results> --drift_start <start extraction timestep> --drift_end  <end extraction timestep> --n_evs <amount of evs to extract>

You can supply a source prompt that describes the original audio by using --source_prompt.

Then apply the PCs:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_pc_apply_drift.py --extraction_path <path to extracted .pt file> --drift_start <timestep to start apply> --drift_end <timestep to end apply> --amount <edit strength> --evs <ev nums to apply>

By using --use_specific_ts_pc <timestep num> you choose a different $t$ from $t'$.
Add --combine_evs to apply all the given PCs together.
Changing --evals_pt to empty will try to get the eigenvalues from the extracted path, and will not work unless the applied timesteps were run in extraction.

Use python main_pc_extract_inv.py --help and python main_pc_apply_drift.py --help for all options.

To recreate the random vectors baseline, use --rand_v. Image samples can be recreated using images_pc_extract_inv.py and images_pc_apply_drift.py.

SDEdit

SDEdit can be run similarly with:

CUDA_VISIBLE_DEVICES=<gpu_num> python main_run_sdedit.py --cfg_tar <target_cfg_strength> --init_aud <input_audio_path> --target_prompt <description of the wanted edited signal> --tstart <edit from timestep> --model_id <model_name> --results_path <path to dump results>

Use python main_run_sdedit.py --help for all options.

Image samples can be recreated using images_run_sdedit.py.

Evaluation

We provide our code used to run LPAPS, CLAP and FAD based evaluations. The code is adapted from multiple repos:

FAD is from microsoft/fadtk.
LPAPS is adapted from richzhang/PerceptualSimilarity.
CLAP is adapted from facebookresearch/audiocraft.

We provide the full code (that works on our directory structure) as an example of use.

MedleyMDPrompts

The MedleyMDPrompts dataset contains manually labeled prompts for the MusicDelta subset of the MedleyDB dataset Bittner et al. 2014. The MusicDelta subset is comprised of 34 musical excerpts in varying styles and in lengths ranging from 20 seconds to 5 minutes.
This prompts dataset includes 3-4 source prompts for each signal, and 3-12 editing target prompts for each of the source prompts, totalling 107 source prompts and 696 target prompts.
In the captions_targets.csv, the column can_be_used_without_source refers to whether this target prompt was designed to complement a source prompt or not, and therefore should provide enough information to edit a signal on their own. This is just a guideline, you might find that for your application all target prompts are enough on their own.
The source_caption_index column indexes the (ordered) index (starting from 1) of the source prompt for the same signal this target prompt relates to. This data can be used together with can_be_used_without_source.

Citation

If you use this code or the MedleyMDPrompts dataset for your research, please cite our paper:

@article{manor2024zeroshot,
    title={Zero-Shot Unsupervised and Text-Based Audio Editing Using {DDPM} Inversion},
    author={Manor, Hila and Michaeli, Tomer},
    journal={arXiv preprint arXiv:2402.10009},
    year={2024},
}

Acknowledgements

Parts of this code are heavily based on DDPM Inversion and on Gaussian Denoising Posterior.

AudioLDM2 is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Therefore, using the weights of AudioLDM2 (the default) and code originating in the code/audioldm folder is under the same license (eg., utils.py:load_audio uses code from code/audioldm).
The rest of the code (inversion, PCs computation) is licensed under an MIT license.

The evaluation code adapts code from differently licensed repos:

FAD is from microsoft/fadtk, under MIT License.
LPAPS is adapted from richzhang/PerceptualSimilarity, under BSD-2-Clause License.
CLAP's weights are under CC0-1.0 License, from LAION-AI/CLAP
CLAP's processing code is adapted from facebookresearch/audiocraft, under MIT License.

Our MedleyMDPrompts dataset is licensed under CC-BY-4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
MedleyMDPrompts		MedleyMDPrompts
code		code
docs		docs
evals		evals
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedleyMDPrompts

MedleyMDPrompts

code

code

docs

docs

evals

evals

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [ICML 2024]

Project page | Arxiv | Text-Based Space

Table of Contents

Requirements

Usage Example

Text-Based Editing

Unsupervised Editing

SDEdit

Evaluation

MedleyMDPrompts

Citation

Acknowledgements

About

Languages

HilaManor/AudioEditingCode

Folders and files

Latest commit

History

Repository files navigation

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion [ICML 2024]

Project page | Arxiv | Text-Based Space

Table of Contents

Requirements

Usage Example

Text-Based Editing

Unsupervised Editing

SDEdit

Evaluation

MedleyMDPrompts

Citation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages