Skip to main navigation menu Skip to main content Skip to site footer

Federated Vision-Language Models for Privacy-Preserving Medical Image Analysis

Abstract

Deep learning has enhanced the analysis of medical images but privacy issues and institutional variations restrict their large scale application in clinics. FedVLM, a federated vision language model tailored to privacy-preserving multimodal medical image analysis, is one of the solutions to these problems. Contrary to the conventional federated design, which can only process single modal image data, FedVLM takes paired radiological images and clinical reports jointly, which demonstrates high zero-shot and few-shot diagnostic performance. The design consists of secure aggregation, differential privacy and proximal optimization that ensure protection of patient data and minimize variability across sites. Large scale experiments on the NIH ChestX-ray14, MIMIC-CXR, and BraTS datasets indicate that FedVLM is an accurate and interpretable model that achieves near-centralized performance on vision language models without violating privacy. Building on previous works such as FACMIC, BioViL, and FAA-CLIP, FedVLM introduces new methods, including privacy-aware optimization, proximal regularization for varied data, and multimodal contrastive alignment, creating a unified federated framework for clear and secure medical image analysis. Although FedVLM shows promising performance, this work is currently at a research stage and is not yet ready for clinical use. We need validation through future multi-institutional clinical studies.

Keywords

Medical Image Analysis, Federated Learning, Vision–Language Models, Privacy-Preserving AI, Clinical Decision Support

PDF

References

  1. G. Litjens et al., “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  2. M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko, W. Xu, D. Marcus, R. R. Colen, and S. Bakas, “Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation,” in International MICCAI Brainlesion Workshop. Springer, 2018, pp. 92–104.
  3. N. Rieke et al., “The future of digital health with federated learning,” NPJ digital medicine, vol. 3, no. 1, p. 119, 2020.
  4. Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” in arXiv preprint arXiv:1806.00582, 2018.
  5. T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  6. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
  7. J. Li, D. Li, C. Xiong, and S. C. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900.
  8. Y. Zhang et al., “Contrastive learning of medical visual representations from paired images and text,” arXiv preprint arXiv:2010.00747, 2022.
  9. B. Boecking et al., “Making the most of text semantics to improve biomedical vision–language processing,” arXiv preprint arXiv:2204.09817, 2022.
  10. [X. Wu et al., “Facmic: Federated adaptive clip for medical image classification,” in MICCAI, 2024.
  11. , “Faa-clip: Federated adaptive attention for medical vision–language models,” IEEE JBHI, 2025.
  12. B. Boecking, Y. Zhang et al., “Making the most of text semantics to improve biomedical vision–language processing,” arXiv preprint arXiv:2204.09817,2022.
  13. I. Dayan, H. R. Roth, A. Zhong et al., “Federated learning for predicting clinical outcomes in patients with covid-19,” Nature Medicine, vol. 27, no. 10,pp. 1735–1743, 2021.
  14. S. Pati et al., “Federated learning enables big data for rare cancer boundary detection,” in MICCAI, 2021, pp. 702–713.
  15. , “The federated tumor segmentation (fets) tool: an open-source solution for multi-institutional collaboration,” NeuroImage, vol. 258, p. 119308,2022.
  16. X. Luo et al., “Influence of inter-site distribution shifts on federated learning in medical imaging,” Radiology: Artificial Intelligence, vol. 5, no. 4, p.e220268, 2023.
  17. J. Manthe et al., “Federated learning benchmark for multi-site radiology segmentation,” Medical Image Analysis, vol. 91, p. 103950, 2024.
  18. M. Rehman et al., “Federated learning for medical image analysis: A review,” Artificial Intelligence in Medicine, vol. 143, p. 102611, 2023.
  19. N. Teo et al., “Systematic review of federated learning in medical imaging,” Computerized Medical Imaging and Graphics, vol. 112, p. 102168, 2024.
  20. Y. Zhang et al., “Contrastive learning of medical visual representations from paired images and text,” in NeurIPS, 2020.
  21. X. Wang et al., “Medclip: Contrastive learning of medical visual representations from paired images and text,” in EMNLP, 2022.
  22. E. Tiu et al., “Expert-level detection of pathologies from unannotated chest x-ray reports using self-supervised learning,” Nature Biomedical Engineering,vol. 6, pp. 1399–1406, 2022.
  23. Z. Lin et al., “Pmc-clip: Contrastive learning on biomedical literature and images,” Bioinformatics, vol. 39, no. 2, p. btad012, 2023.
  24. Y. Zhang et al., “Knowledge-enhanced vision–language pretraining for radiology,” Nature Communications, vol. 14, no. 1, p. 5674, 2023.
  25. J. Hartsock et al., “Vision–language models for radiology report generation and retrieval,” Medical Image Analysis, vol. 93, p. 103995, 2024.
  26. S. Ryu et al., “A systematic review of vision–language models in medical imaging,” IEEE Transactions on Medical Imaging, 2025.
  27. Y. Chen et al., “Vision–language foundation models for medicine: a survey,” arXiv preprint arXiv:2309.00000, 2023.
  28. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,”in AISTATS, 2017, pp. 1273–1282.
  29. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in MLSys, 2020.
  30. G. A. Kaissis, M. R. Makowski, D. Ruckert, and R. F. Braren, “Secure, privacy-preserving and federated machine learning in medical imaging,” ¨ Nature Machine Intelligence, vol. 2, no. 6, pp. 305–311, 2020.
  31. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
  32. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021. 17
  33. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
  34. E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop. ACL, 2019, pp. 72–78.
  35. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1175–1191.
  36. X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2097–2106, 2017.
  37. A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr: A large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
  38. B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 2015.
  39. S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. H. Ha, M. Rozycki et al., “Identifying the best machin learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge,” Medical Image Analysis, vol. 55, pp. 254–268, 2019.
  40. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 8026–8037.
  41. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626.

Downloads

Download data is not yet available.

Similar Articles

1-10 of 98

You may also start an advanced similarity search for this article.