Generating Structured Radiological Reports from Brain MRI Data Based on SigLIP2 Frozen Embeddings
DOI:
https://doi.org/10.18372/1990-5548.88.20960Keywords:
medical imaging, report generation, brain MRI, SigLIP2, GPT-2, transfer learning, few-shot learningAbstract
Automatic generation of clinical reports from medical images is a relevant task capable of reducing the workload of radiologists and standardizing documentation. In this paper, we investigate an approach to generating structured reports from brain MRI data using a pre-trained multimodal SigLIP2 model as a feature extractor. We propose an architecture in which visual embeddings obtained from a frozen SigLIP2 are projected into the representation space of the GPT-2 language model for subsequent text generation. Experiments were conducted on the open-access BIOSE MRI dataset, containing 34 pairs of "MRI image + clinical report". It is shown that the proposed approach generates semantically meaningful reports, achieving quality comparable to more complex architectures with substantially lower computational costs. Additionally, the influence of pre-training SigLIP2 on a classification task (Brain3-Anomaly-SigLIP2 version) on generation quality is investigated. The results demonstrate the potential of using frozen vision encoders in medical generative tasks under data-scarce conditions.
References
T. Noor Rahman, T. Paul, T. Zarin Tasnim, et al. “BIOSE MRI: A Multimodal Brain MRI Dataset with Clinical Findings for Neuroimaging Research,” Mendeley Data, vol. 2, 2025. https://doi.org/10.17632/9mcp5pbtbr.2
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “MedCLIP: Contrastive learning from unpaired medical images and text,” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3876–3887. https://doi.org/10.18653/v1/2022.emnlp-main.256
A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” In International Conference on Machine Learning, 2021, pp. 8748–8763. PMLR.
Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive Learning of Medical Visual Representations from Paired Images and Text,” In Machine Learning for Healthcare (MLHC), pp. 123–138, 2023.
S. C. Huang, L. Shen, M. P. Lungrenand S. Yeung, “GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951. https://doi.org/10.1109/ICCV48922.2021.00391
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, (2023). “Sigmoid loss for language image pre-training,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV51070.2023.01100
Google Research. (2024). SigLIP2: Improved Vision-Language Pretraining with Dense Features. Technical Report.
Hugging Face. (2025). Brain3-Anomaly-SigLIP2: Fine-tuned classification model for brain anomalies. https://huggingface.co/models
S. C. Huang, L. Shen, M. P. Lungren, S. Yeung, “GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951. https://doi.org/10.1109/ICCV48922.2021.00391
K. You, J. Gu, J. Ham, et al., “CXR-CLIP: Toward large scale chest x-ray language-image pre-training,” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023, pp. 101–111. Springer. https://doi.org/10.1007/978-3-031-43895-0_10
C. Zhang, et al., (2023). “MEDBind: Unifying Language and Multimodal Medical Data Embeddings,” In Medical Image Computing and Computer Assisted Intervention, – MICCAI 2024. Springer.
MedVAG: Medical Visual Answer Generation, 2024, Technical Report.
AIM-X: Attention-based Interpretable Medical Report Generation, 2024, Technical Report.
AutoRG-Brain: Automated Report Generation for Brain MRI, 2024, Technical Report.
I. Lopez, F. N. Haredasht, K. Caoili, et al., (2025). Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation. arXiv preprint arXiv:2501.11199.
E. Frayling, J. Lever, and G. McDonald, (2024). Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records. arXiv preprint arXiv:2403.08664.
A. E. Johnson, T. J. Pollard, N. R. Greenbaum, et al. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. https://doi.org/10.1038/s41597-019-0322-0
T. Noor Rahman, T. Paul, T. Zarin Tasnim, et al., (2025). BIOSE MRI: A Multimodal Brain MRI Dataset with Clinical Findings for Neuroimaging Research. Mendeley Data, V2. https://doi.org/10.17632/9mcp5pbtbr.2
A. Radford, J. Wu, R. Child, et al., (2019). Language models are unsupervised multitask learners. OpenAI Blog.
I. Loshchilov, and F. Hutter, (2018). Decoupled weight decay regularization. In ICLR.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
The scientific journal “Electronics and control systems” adheres to the principles of Open Access and provides free, immediate, and permanent access to all published materials without financial, technical, or legal barriers for readers.
All articles are published in Open Access under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Copyright
Authors who publish their works in the journal “Electronics and control systems”:
-
retain the copyright to their publications;
-
grant the journal the right of first publication of the article;
-
agree to the distribution of their materials under the CC BY 4.0 license;
-
have the right to reuse, archive, and distribute their works (including in institutional and subject repositories), provided that proper reference is made to the original publication in the journal.