Generating Structured Radiological Reports from Brain MRI Data Based on SigLIP2 Frozen Embeddings

Andrew Sheruda

doi:10.18372/1990-5548.88.20960

Authors

Andrew Sheruda National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

DOI:

https://doi.org/10.18372/1990-5548.88.20960

Keywords:

medical imaging, report generation, brain MRI, SigLIP2, GPT-2, transfer learning, few-shot learning

Abstract

Automatic generation of clinical reports from medical images is a relevant task capable of reducing the workload of radiologists and standardizing documentation. In this paper, we investigate an approach to generating structured reports from brain MRI data using a pre-trained multimodal SigLIP2 model as a feature extractor. We propose an architecture in which visual embeddings obtained from a frozen SigLIP2 are projected into the representation space of the GPT-2 language model for subsequent text generation. Experiments were conducted on the open-access BIOSE MRI dataset, containing 34 pairs of "MRI image + clinical report". It is shown that the proposed approach generates semantically meaningful reports, achieving quality comparable to more complex architectures with substantially lower computational costs. Additionally, the influence of pre-training SigLIP2 on a classification task (Brain3-Anomaly-SigLIP2 version) on generation quality is investigated. The results demonstrate the potential of using frozen vision encoders in medical generative tasks under data-scarce conditions.

Author Biography

Andrew Sheruda, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

Postgraduate Student

Department of Artificial Intelligence

Institute of Applied Systems Analysis

References

T. Noor Rahman, T. Paul, T. Zarin Tasnim, et al. “BIOSE MRI: A Multimodal Brain MRI Dataset with Clinical Findings for Neuroimaging Research,” Mendeley Data, vol. 2, 2025. https://doi.org/10.17632/9mcp5pbtbr.2

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “MedCLIP: Contrastive learning from unpaired medical images and text,” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3876–3887. https://doi.org/10.18653/v1/2022.emnlp-main.256

A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” In International Conference on Machine Learning, 2021, pp. 8748–8763. PMLR.

Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive Learning of Medical Visual Representations from Paired Images and Text,” In Machine Learning for Healthcare (MLHC), pp. 123–138, 2023.

S. C. Huang, L. Shen, M. P. Lungrenand S. Yeung, “GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951. https://doi.org/10.1109/ICCV48922.2021.00391

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, (2023). “Sigmoid loss for language image pre-training,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV51070.2023.01100

Google Research. (2024). SigLIP2: Improved Vision-Language Pretraining with Dense Features. Technical Report.

Hugging Face. (2025). Brain3-Anomaly-SigLIP2: Fine-tuned classification model for brain anomalies. https://huggingface.co/models

S. C. Huang, L. Shen, M. P. Lungren, S. Yeung, “GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition,” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951. https://doi.org/10.1109/ICCV48922.2021.00391

K. You, J. Gu, J. Ham, et al., “CXR-CLIP: Toward large scale chest x-ray language-image pre-training,” In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023, pp. 101–111. Springer. https://doi.org/10.1007/978-3-031-43895-0_10

C. Zhang, et al., (2023). “MEDBind: Unifying Language and Multimodal Medical Data Embeddings,” In Medical Image Computing and Computer Assisted Intervention, – MICCAI 2024. Springer.

MedVAG: Medical Visual Answer Generation, 2024, Technical Report.

AIM-X: Attention-based Interpretable Medical Report Generation, 2024, Technical Report.

AutoRG-Brain: Automated Report Generation for Brain MRI, 2024, Technical Report.

I. Lopez, F. N. Haredasht, K. Caoili, et al., (2025). Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation. arXiv preprint arXiv:2501.11199.

E. Frayling, J. Lever, and G. McDonald, (2024). Zero-shot and Few-shot Generation Strategies for Artificial Clinical Records. arXiv preprint arXiv:2403.08664.

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, et al. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. https://doi.org/10.1038/s41597-019-0322-0

T. Noor Rahman, T. Paul, T. Zarin Tasnim, et al., (2025). BIOSE MRI: A Multimodal Brain MRI Dataset with Clinical Findings for Neuroimaging Research. Mendeley Data, V2. https://doi.org/10.17632/9mcp5pbtbr.2

A. Radford, J. Wu, R. Child, et al., (2019). Language models are unsupervised multitask learners. OpenAI Blog.

I. Loshchilov, and F. Hutter, (2018). Decoupled weight decay regularization. In ICLR.

Generating Structured Radiological Reports from Brain MRI Data Based on SigLIP2 Frozen Embeddings

Authors

DOI:

Keywords:

Abstract

Author Biography

Andrew Sheruda, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Language

Information

Make a Submission