用于文本引导音效生成特征增强扩散模型

It is used to generate feature enhancement diffusion model for text-guided sound effects

ES评分 0

DOI 10.12208/j.aics.20250004
刊名
Advances in International Computer Science
年,卷(期) 2025, 5(1)
作者
作者单位

湖南工商大学 湖南长沙

摘要
音效在游戏、电影和虚拟现实等领域具有重要作用,它通过声音描述事件的发生,增强听众沉浸感。随着深度学习发展和大语言模型的出现,音效生成技术迎来革命性突破,特别是基于文本引导的音效生成技术,该技术通过文本描述就可以自动生成符合场景的音效。然而,现有生成模型和方法仍存在生成音频逼真度欠缺、文本音频相关度低等问题。本文针对这些问题提出了一种新型特征增强扩散模型(Feature Enhanced Diffusion Model, FEDM),(1)采用Haar小波变换进行下采样,有效保留高频特征信息;(2)设计多尺度特征提取模块,通过不同尺寸卷积核捕捉多层次特征。实验结果表明,所提方法在AudioCaps数据集上的FAD和KL指标上比基线模型提升了33.3%和18.1%。
Abstract
Sound effects play a crucial role in games, films, and virtual reality, enhancing the immersion of listeners by describing events through sound. With the development of deep learning and the emergence of large language models, sound effect generation technology has seen revolutionary advancements, particularly text-guided sound effect generation techniques, which can automatically generate sounds that match the scene based on textual descriptions. However, existing generation models and methods still suffer from issues such as insufficient audio realism and low relevance between text and audio. This paper proposes a novel feature-enhanced diffusion model (Feature Enhanced Diffusion Model, FEDM) to address these problems: (1) it uses Haar wavelet transform for downsampling, effectively retaining high-frequency feature information; (2) it designs a multi-scale feature extraction module to capture multi-level features through different-sized convolutional kernels. Experimental results show that the proposed method improves FAD and KL metrics by 33.3% and 18.1%, respectively, over the baseline model on the AudioCaps dataset.
关键词
音效生成;文本引导;扩散模型;小波变换;多尺度提取
KeyWord
Sound effect generation; Text guidance; Diffusion model; Wavelet transform; Multi-scale extraction
基金项目
页码 18-22
  • 参考文献
  • 相关文献
  • 引用本文

苗向阳. 用于文本引导音效生成特征增强扩散模型 [J]. 国际计算机科学进展. 2025; 5; (1). 18 - 22.

  • 文献评论

相关学者

相关机构