Adaptive hybrid transformers for controllable audio synthesis via representation alignment and dynamic modality weighting

Вадим Мухін; Ярослав Хабло

doi:10.18372/2073-4751.85.21098

Authors

Вадим Мухін https://orcid.org/0000-0002-1206-9131
Ярослав Хабло https://orcid.org/0009-0003-4983-0726

DOI:

https://doi.org/10.18372/2073-4751.85.21098

Keywords:

controllable audio synthesis, Foley generation, multimodal diffusion transformers, adaptive hybrid transformers, gated cross-attention (GCA), dynamic attention fusion (DAF), entropy-based modality weighting, representation alignment (REPA/iREPA), parameter-efficient fine-tuning, LoRA, MoE-LoRA, AuditEval-ssl

Abstract

This article proposes an adaptive hybrid transformer framework for controllable audio (Foley) synthesis that addresses the persistent “control gap” between user-intended perceptual attributes (e.g., pitch and intensity) and the characteristics realized in diffusion-based generative latent spaces. The method integrates three complementary mechanisms: Gated Cross-Attention (GCA) to stabilize multimodal fusion and suppress irrelevant visual tokens, mitigating attention collapse and attention-sink behavior, Dynamic Attention Fusion (DAF) that assigns context-dependent modality weights using normalized Shannon entropy as a differentiable reliability proxy, improving robustness under modality degradation (e.g., visual noise or vague prompts); and improved Representation Alignment (iREPA) that distills structural knowledge from frozen teacher encoders to accelerate convergence while preserving spatial/temporal structure relevant to synchronization. For parameter-efficient controllability, the framework employs LoRA/MoE-LoRA adapters as functional control bases, enabling fine-grained manipulation of acoustic attributes with minimal additional parameters. Quantitative evaluation uses controllability-specific metrics (CSS/COI) and automated validation via AuditEval-ssl, demonstrating strong correlation with expert ratings and improved robustness in combined-noise scenarios.

References

Wang J. Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis. arXiv preprint arXiv:2510.12175. 2025. URL: https://arxiv.org/abs/2510.12175.

Jia Y., Wang H., Nie X., Guo Y., Gao L., Qin Y. Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method. arXiv preprint arXiv:2508.11966. 2025. URL: https://arxiv.org/abs/2508.11966.

Mai S., Zeng Y., Zheng S., Hu H. Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis. IEEE Transactions on Affective Computing. 2021. Vol. 14. P. 2276–2289. URL: https://zhenglab.sjtu.edu.cn/uploadfile/ueditor/file/202406/17175674613c804a.pdf.

Wu Y. et al. LAION-AI/CLAP: Contrastive Language-Audio Pretraining. GitHub repository. 2023. URL: https://github.com/LAION-AI/CLAP.

Dinkel H., Yan Z., Wang T. et al. GLAP: General contrastive audio-text pretraining across domains and languages. arXiv preprint arXiv:2506.11350. 2025. URL: https://arxiv.org/abs/2506.11350.

Gated Cross-Attention in Neural Networks. Emergent Mind. 2025. URL: https://www.emergentmind.com/topics/gated-cross-attention.

Abdulhalim S., Albaghdadi M., Farazi M. Multi-Modal Sentiment Analysis with Dynamic Attention Fusion. arXiv preprint arXiv:2509.22729. 2025. URL: https://arxiv.org/abs/2509.22729.

Yu S., Kwak S., Jang H. et al. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. International Conference on Learning Representations (ICLR). 2025. URL: https://huggingface.co/papers/2410.06940.

Wang Y., He J., Wang D., Wang Q. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing. 2023. Vol. 572. URL: https://www.researchgate.net/publication/376895013_Multimodal_transformer_with_adaptive_modality_weighting_for_multimodal_sentiment_analysis.

Siriwardhana S., Kaluarachchi T., Billinghurst M., Nanayakkara S. Adaptive weighting in a transformer framework for multimodal emotion recognition. ResearchGate preprint. 2025. URL: https://www.researchgate.net/publication/397920846_Adaptive_weighting_in_a_transformer_framework_for_multimodal_emotion_recognition.

Yu S., Kwak S., Jang H. et al. What matters for Representation Alignment: Global Information or Spatial Structure? arXiv preprint arXiv:2512.10794. 2025. URL: https://arxiv.org/abs/2512.10794.

Wu G., Zhang S., Shi R. et al. Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think. arXiv preprint arXiv:2507.01467. 2025. URL: https://arxiv.org/abs/2507.01467.

Huan M., Shun J. Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact. Preprints.org. 2025. URL: https://www.preprints.org/manuscript/202502.1637.

Laakkonen J., Kukanov I., Hautamäki V. Mixture of Low-Rank Adapter Experts in Generalizable Audio Deepfake Detection. arXiv preprint arXiv:2509.13878. 2025. URL: https://arxiv.org/abs/2509.13878.

The Nam. Phi-4-multimodal - Mixture of LoRAs. Medium. 2025. URL: https://medium.com/@namnguyenthe/phi-4-multimodal-mixture-of-loras-85f640592b39.

Liu H., Wang J., Huang R. et al. FlashAudio: Rectified Flows for Fast and High-fidelity Text-to-Audio Generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 2025. P. 13694–13710. URL: https://aclanthology.org/2025.acl-long.673.pdf.

Liu H., Wang J., Luo K. et al. ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing. arXiv preprint arXiv:2506.21448. 2025. URL: https://arxiv.org/abs/2506.21448.

Fréchet Audio Distance (FAD). Emergent Mind. 2025. URL: https://www.emergentmind.com/topics/frechet-audio-distance-fad.

Shan S., Li Q., Cui Y. et al. HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation. arXiv preprint arXiv:2508.16930. 2025. URL: https://arxiv.org/abs/2508.16930.

Takahashi A., Takahashi S., Mitsufuji Y. MMAudioSep: Taming Video-to-Audio Generative Model towards Video/Text-Queried Sound Separation. arXiv preprint arXiv:2510.09065. 2025. URL: https://arxiv.org/abs/2510.09065.

Cheng H. K., Ishii M., Hayakawa A. et al. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. arXiv preprint arXiv:2412.15322. 2024. URL: https://arxiv.org/abs/2412.15322.

Dinkel H., Li G., Liu J. et al. MiDashengLM: Efficient Audio Understanding with General Audio Captions. arXiv preprint arXiv:2508.03983. 2025. URL: https://arxiv.org/abs/2508.03983.

Language-Based Audio Retrieval. DCASE Challenge. 2025. URL: https://dcase.community/challenge2025/task-language-based-audio-retrieval.

Yu J., Zhu L., Chi Y. et al. Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition. arXiv preprint arXiv:2503.10603. 2025. URL: https://arxiv.org/pdf/2503.10603.

Adaptive hybrid transformers for controllable audio synthesis via representation alignment and dynamic modality weighting

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Language

Information

Make a Submission