EDMSound

[Code Base] [arxiv paper]

Abstract Diffusion models have showcased their capabilities in audio synthesis ranging over a variety of sounds. Exsiting models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. It potentially introduces challenges in generating high-fidelity audio. In this project, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data.

Section 1: Audio Generation with EDMSound

Unconditional Generation Results on SC09*

Model	Zero	Three	Five	Six	Seven	Eight
Training Sample
EDMSound

* Note: we generate the above samples using EDMSound unconditionally and manually label them for the table.

Conditional Generation Results on DCASE2023 Task-7

Category	Training Sample	EDMSound
Dog Bark
Footstep
Gunshot
Keyboard
Moving Motor Vehicle
Rain
Sneeze, Cough

Section 2: Replication Detectoin

Model	Generated Sample	Top-1 Matched Training Sample
EDMSound


Scheibler et al


Yi et al


Jung et al

References

[1] R. Scheibler, T. Hasumi, Y. Fujita, T. Komatsu, R. Yamamoto, and K. Tachibana. "Class-conditioned latent diffusion model for dcase 2023 foley sound synthesis challenge." Technical report, Tech. Rep., June, 2023.

[2] Y. Yuan, H. Liu, X. Liu, X. Kang, M. D. Plumbley, and W. Wang. "Latent diffusion model based foley sound generation system for dcase challenge 2023 task 7." arXiv preprint arXiv:2305.15905, 2023.

[3] H. C. Chung, Y. Lee, and J. H. Jung. "Foley sound synthesis based on gan using contrastive learning without label information." Technical report, Tech. Rep., June, 2023.

[4] Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018).

[5] Choi, Keunwoo, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinosuke Takamichi. "Foley sound synthesis at the dcase 2023 challenge." arXiv preprint arXiv:2304.12521 (2023).