Adaptive hybrid transformers for controllable audio synthesis via representation alignment and dynamic modality weighting
DOI:
https://doi.org/10.18372/2073-4751.85.21098Keywords:
controllable audio synthesis, Foley generation, multimodal diffusion transformers, adaptive hybrid transformers, gated cross-attention (GCA), dynamic attention fusion (DAF), entropy-based modality weighting, representation alignment (REPA/iREPA), parameter-efficient fine-tuning, LoRA, MoE-LoRA, AuditEval-sslAbstract
This article proposes an adaptive hybrid transformer framework for controllable audio (Foley) synthesis that addresses the persistent “control gap” between user-intended perceptual attributes (e.g., pitch and intensity) and the characteristics realized in diffusion-based generative latent spaces. The method integrates three complementary mechanisms: Gated Cross-Attention (GCA) to stabilize multimodal fusion and suppress irrelevant visual tokens, mitigating attention collapse and attention-sink behavior, Dynamic Attention Fusion (DAF) that assigns context-dependent modality weights using normalized Shannon entropy as a differentiable reliability proxy, improving robustness under modality degradation (e.g., visual noise or vague prompts); and improved Representation Alignment (iREPA) that distills structural knowledge from frozen teacher encoders to accelerate convergence while preserving spatial/temporal structure relevant to synchronization. For parameter-efficient controllability, the framework employs LoRA/MoE-LoRA adapters as functional control bases, enabling fine-grained manipulation of acoustic attributes with minimal additional parameters. Quantitative evaluation uses controllability-specific metrics (CSS/COI) and automated validation via AuditEval-ssl, demonstrating strong correlation with expert ratings and improved robustness in combined-noise scenarios.
References
Wang J. Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis. arXiv preprint arXiv:2510.12175. 2025. URL: https://arxiv.org/abs/2510.12175.
Jia Y., Wang H., Nie X., Guo Y., Gao L., Qin Y. Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method. arXiv preprint arXiv:2508.11966. 2025. URL: https://arxiv.org/abs/2508.11966.
Mai S., Zeng Y., Zheng S., Hu H. Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis. IEEE Transactions on Affective Computing. 2021. Vol. 14. P. 2276–2289. URL: https://zhenglab.sjtu.edu.cn/uploadfile/ueditor/file/202406/17175674613c804a.pdf.
Wu Y. et al. LAION-AI/CLAP: Contrastive Language-Audio Pretraining. GitHub repository. 2023. URL: https://github.com/LAION-AI/CLAP.
Dinkel H., Yan Z., Wang T. et al. GLAP: General contrastive audio-text pretraining across domains and languages. arXiv preprint arXiv:2506.11350. 2025. URL: https://arxiv.org/abs/2506.11350.
Gated Cross-Attention in Neural Networks. Emergent Mind. 2025. URL: https://www.emergentmind.com/topics/gated-cross-attention.
Abdulhalim S., Albaghdadi M., Farazi M. Multi-Modal Sentiment Analysis with Dynamic Attention Fusion. arXiv preprint arXiv:2509.22729. 2025. URL: https://arxiv.org/abs/2509.22729.
Yu S., Kwak S., Jang H. et al. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. International Conference on Learning Representations (ICLR). 2025. URL: https://huggingface.co/papers/2410.06940.
Wang Y., He J., Wang D., Wang Q. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing. 2023. Vol. 572. URL: https://www.researchgate.net/publication/376895013_Multimodal_transformer_with_adaptive_modality_weighting_for_multimodal_sentiment_analysis.
Siriwardhana S., Kaluarachchi T., Billinghurst M., Nanayakkara S. Adaptive weighting in a transformer framework for multimodal emotion recognition. ResearchGate preprint. 2025. URL: https://www.researchgate.net/publication/397920846_Adaptive_weighting_in_a_transformer_framework_for_multimodal_emotion_recognition.
Yu S., Kwak S., Jang H. et al. What matters for Representation Alignment: Global Information or Spatial Structure? arXiv preprint arXiv:2512.10794. 2025. URL: https://arxiv.org/abs/2512.10794.
Wu G., Zhang S., Shi R. et al. Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think. arXiv preprint arXiv:2507.01467. 2025. URL: https://arxiv.org/abs/2507.01467.
Huan M., Shun J. Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact. Preprints.org. 2025. URL: https://www.preprints.org/manuscript/202502.1637.
Laakkonen J., Kukanov I., Hautamäki V. Mixture of Low-Rank Adapter Experts in Generalizable Audio Deepfake Detection. arXiv preprint arXiv:2509.13878. 2025. URL: https://arxiv.org/abs/2509.13878.
The Nam. Phi-4-multimodal - Mixture of LoRAs. Medium. 2025. URL: https://medium.com/@namnguyenthe/phi-4-multimodal-mixture-of-loras-85f640592b39.
Liu H., Wang J., Huang R. et al. FlashAudio: Rectified Flows for Fast and High-fidelity Text-to-Audio Generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 2025. P. 13694–13710. URL: https://aclanthology.org/2025.acl-long.673.pdf.
Liu H., Wang J., Luo K. et al. ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing. arXiv preprint arXiv:2506.21448. 2025. URL: https://arxiv.org/abs/2506.21448.
Fréchet Audio Distance (FAD). Emergent Mind. 2025. URL: https://www.emergentmind.com/topics/frechet-audio-distance-fad.
Shan S., Li Q., Cui Y. et al. HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation. arXiv preprint arXiv:2508.16930. 2025. URL: https://arxiv.org/abs/2508.16930.
Takahashi A., Takahashi S., Mitsufuji Y. MMAudioSep: Taming Video-to-Audio Generative Model towards Video/Text-Queried Sound Separation. arXiv preprint arXiv:2510.09065. 2025. URL: https://arxiv.org/abs/2510.09065.
Cheng H. K., Ishii M., Hayakawa A. et al. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. arXiv preprint arXiv:2412.15322. 2024. URL: https://arxiv.org/abs/2412.15322.
Dinkel H., Li G., Liu J. et al. MiDashengLM: Efficient Audio Understanding with General Audio Captions. arXiv preprint arXiv:2508.03983. 2025. URL: https://arxiv.org/abs/2508.03983.
Language-Based Audio Retrieval. DCASE Challenge. 2025. URL: https://dcase.community/challenge2025/task-language-based-audio-retrieval.
Yu J., Zhu L., Chi Y. et al. Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition. arXiv preprint arXiv:2503.10603. 2025. URL: https://arxiv.org/pdf/2503.10603.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
The scientific journal adheres to the principles of Open Access and provides free, immediate, and permanent access to all published materials without financial, technical, or legal barriers for readers.
All articles are published in Open Access under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Copyright
Authors who publish their works in the journal:
-
retain the copyright to their publications;
-
grant the journal the right of first publication of the article;
-
agree to the distribution of their materials under the CC BY 4.0 license;
-
have the right to reuse, archive, and distribute their works (including in institutional and subject repositories), provided that proper reference is made to the original publication in the journal.