Skip to main navigation menu Skip to main content Skip to site footer

A Hybrid LLM–Knowledge Graph Framework for Accurate Biomedical Question Answering

Abstract

Biomedical question answering requires accurate and interpretable systems; however, existing approaches often face challenges such as language model hallucinations and limited reasoning when relying solely on standalone knowledge graphs. To address these limitations, this study proposes a hybrid framework that integrates the LLaMA-3B language model with a Neo4j-based drug–disease–symptom knowledge graph. The system translates natural language questions into executable Cypher queries, operates on an iBKH-derived graph comprising over 65,000 entities and 3 million relationships, and returns answers with supporting evidence through a transparent interface. Experiments conducted on 60 biomedical questions across three levels of difficulty demonstrate the robustness of the approach: 96% exact match for simple queries, 95% for medium queries, and 86.7% for complex queries. Overall, the system achieves Precision@5 of 96.1%, Recall@5 of 89.0%, F1@5 of 91.0%, Hits@k of 96.1%, and an MRR of 94.4%, while maintaining an average response time of only 6.07 seconds. These results indicate that the system retrieves nearly all relevant answers, ranks them correctly, and delivers them with latency low enough for interactive use. Moreover, unlike cloud-based APIs such as ChatGPT, which require internet connectivity and external data transmission, the proposed framework operates fully offline, ensuring privacy, reproducibility, and compliance with biomedical data governance. Overall, this pipeline provides an accurate, efficient, and privacy-preserving solution for biomedical question answering, making it a practical alternative to cloud-dependent approaches in sensitive healthcare contexts.

Keywords

Knowledge Graph, LLM, Question Answering, Neo4j, Biomedical Informatics, Healthcare AI, LLaMA 3

PDF

References

  1. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  2. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  3. Y. Hou, J. Yeung, H. Xu, C. Su, F. Wang, and R. Zhang, “From answers to insights: unveiling the strengths and limitations of chatgpt and biomedical knowledge graphs,” Research Square, pp. rs–3, 2023.
  4. Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
  5. C. Malaviya, S. Lee, S. Chen, E. Sieber, M. Yatskar, and D. Roth, “Expertqa: expert-curated questions and attributed answers,” arXiv preprint arXiv:2309.07852, 2023.
  6. L. Pusch and T. O. Conrad, “Combining llms and knowledge graphs to reduce hallucinations in question answering,” arXiv preprint arXiv:2409.04181, 2024.
  7. H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, 2023.
  8. T. Sekar, Kushal, S. Shankar, S. Mohammed, and J. Fiaidhi, “Investigations on using evidence-based graphrag pipeline using llm tailored for usmle style questions,” medRxiv, pp. 2025–05, 2025.
  9. S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal, “Detecting hallucinations in large language models using semantic entropy,” Nature, vol. 630, no. 8017, pp. 625–630, 2024.
  10. E. Asgari, N. Montaña-Brown, M. Dubois, S. Khalil, J. Balloch, J. A. Yeung, and D. Pimenta, “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,” npj Digital Medicine, vol. 8, no. 1, p. 274, 2025.
  11. C. Su, Y. Hou, M. Zhou, S. Rajendran, J. R. Maasch, Z. Abedi, H. Zhang, Z. Bai, A. Cuturrufo, W. Guo, et al., “Biomedical discovery through the integrative biomedical knowledge hub (ibkh),” Iscience, vol. 26, no. 4, 2023.
  12. J. H. Morris, K. Soman, R. E. Akbas, X. Zhou, B. Smith, E. C. Meng, C. C. Huang, G. Cerono, G. Schenk, A. Rizk-Jackson, et al., “The scalable precision medicine open knowledge engine (spoke): a massive knowledge graph of biomedical information,” Bioinformatics, vol. 39, no. 2, p. btad080, 2023.
  13. K. Soman, P. W. Rose, J. H. Morris, R. E. Akbas, B. Smith, B. Peetoom, C. Villouta-Reyes, G. Cerono, Y. Shi, A. Rizk-Jackson, et al., “Biomedical knowledge graph-optimized prompt generation for large language models,” Bioinformatics, vol. 40, no. 9, p. btae560, 2024.
  14. F. Frau, P. Loustalot, M. Törnqvist, N. Temam, J. Cupe, M. Montmerle, and F. Augé, “Connecting electronic health records to a biomedical knowledge graph to link clinical phenotypes and molecular endotypes in atopic dermatitis,” Scientific Reports, vol. 15, no. 1, p. 3082, 2025.
  15. Y. Gao, R. Li, E. Croxford, J. Caskey, B. W. Patterson, M. Churpek, T. Miller, D. Dligach, and M. Afshar, “Leveraging medical knowledge graphs into large language models for diagnosis prediction: design and application study,” Jmir AI, vol. 4, p. e58670, 2025.
  16. Y. Yan, Y. Hou, Y. Xiao, R. Zhang, and Q. Wang, “Knownet: guided health information seeking from llms via knowledge graph integration,” IEEE Transactions on Visualization and Computer Graphics, 2024.
  17. Y. Deng, S. Zhao, Y. Miao, J. Zhu, and J. Li, “Medka: a knowledge graph-augmented approach to improve factuality in medical large language models,” Journal of Biomedical Informatics, p. 104871, 2025.
  18. L. Ehrlinger and W. Wöß, “Towards a definition of knowledge graphs,” SEMANTiCS (Posters, Demos, SuCCESS), vol. 48, no. 1–4, p. 2, 2016.
  19. E. Rajabi and S. Kafaie, “Building a disease knowledge graph,” in Caring is Sharing – Exploiting the Value in Data for Health and Innovation, pp. 701–705, IOS Press, 2023.
  20. L. Guan, Y. Huang, and J. Liu, “Biomedical question answering via multi-level summarization on a local knowledge graph,” arXiv preprint arXiv:2504.01309, 2025.
  21. D. Steinigen, R. Teucher, T. H. Ruland, M. Rudat, N. Flores-Herr, P. Fischer, N. Milosevic, C. Schymura, and A. Ziletti, “Fact finder – enhancing domain expertise of large language models by incorporating knowledge graphs,” arXiv preprint arXiv:2408.03010, 2024.
  22. Y. Feng, L. Zhou, C. Ma, Y. Zheng, R. He, and Y. Li, “Knowledge graph–based thought: a knowledge graph–enhanced llm framework for pan-cancer question answering,” GigaScience, vol. 14, p. giae082, 2025.
  23. H. Luo, Z. Tang, S. Peng, Y. Guo, W. Zhang, C. Ma, G. Dong, M. Song, W. Lin, Y. Zhu, et al., “Chatkbqa: a generate-then-retrieve framework for knowledge base question answering with fine-tuned large language models,” arXiv preprint arXiv:2310.08975, 2023.
  24. A. Tiwari, S. K. R. Malay, V. Yadav, M. Hashemi, and S. T. Madhusudhan, “Auto-cypher: improving llms on cypher generation via llm-supervised generation-verification framework,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 623–640, 2025.
  25. R. Wang, Z. Zhang, L. Rossetto, F. Ruosch, and A. Bernstein, “Nlqxform: a language model-based question to sparql transformer,” arXiv preprint arXiv:2311.07588, 2023.
  26. M. R. Rezaei, R. S. Fard, J. L. Parker, R. G. Krishnan, and M. Lankarany, “Agentic medical knowledge graphs enhance medical question answering: bridging the gap between llms and evolving medical knowledge,” arXiv preprint arXiv:2502.13010, 2025.
  27. Z. Dong, B. Peng, Y. Wang, J. Fu, X. Wang, Y. Shan, and X. Zhou, “Effiqa: efficient question-answering with strategic multi-model collaboration on knowledge graphs,” arXiv preprint arXiv:2406.01238, 2024.
  28. Y. Duan, Q. Zhou, Y. Li, C. Qin, Z. Wang, H. Kan, and J. Hu, “Research on a traditional chinese medicine case-based question-answering system integrating large language models and knowledge graphs,” Frontiers in Medicine, vol. 11, p. 1512329, 2025.
  29. S. Mohammed, J. Fiaidhi, T. Sekar, K. Kushal, and S. Shankar, “Investigations on using evidence-based graphrag pipeline using llm tailored for answering usmle medical exam questions,” medRxiv, pp. 2025–05, 2025.
  30. H. Yang, J. Li, C. Zhang, A. P. Sierra, and B. Shen, “Large language model–driven knowledge graph construction in sepsis care using multicenter clinical databases: development and usability study,” Journal of Medical Internet Research, vol. 27, p. e65537, 2025.
  31. K.-L. Hsieh, G. Plascencia-Villa, K.-H. Lin, G. Perry, X. Jiang, and Y. Kim, “Synthesize heterogeneous biological knowledge via representation learning for alzheimer’s disease drug repurposing,” Iscience, vol. 26, no. 1, 2023.
  32. R. Angles and C. Gutierrez, “Survey of graph database models,” ACM Computing Surveys (CSUR), vol. 40, no. 1, pp. 1–39, 2008.
  33. B. Chicho and A. O. Mohammed, “An empirical comparison of neo4j and tigergraph databases for network centrality,” Science Journal of University of Zakho, vol. 11, no. 2, pp. 190–201, 2023.
  34. I. Robinson, J. Webber, and E. Eifrem, Graph Databases: New Opportunities for Connected Data, O’Reilly Media, 2015.
  35. A. Lysenko, I. A. Roznov??, M. Saqi, A. Mazein, C. J. Rawlings, and C. Auffray, “Representing and querying disease networks using graph databases,” BioData Mining, vol. 9, no. 1, p. 23, 2016.
  36. M. Šestak, M. Heri?ko, T. W. Družovec, and M. Turkanovi?, “Applying k-vertex cardinality constraints on a neo4j graph database,” Future Generation Computer Systems, vol. 115, pp. 459–474, 2021.
  37. M. Desai, R. G. Mehta, and D. P. Rana, “An empirical analysis to identify the effect of indexing on influence detection using graph databases,” International Journal of Innovative Technology and Exploring Engineering, vol. 8, no. 9S, pp. 414–421, 2019.
  38. S. Beis, S. Papadopoulos, and Y. Kompatsiaris, “Benchmarking graph databases on the problem of community detection,” in New Trends in Database and Information Systems II, pp. 3–14, Springer, 2015.
  39. R. Wang, Z. Yang, W. Zhang, and X. Lin, “An empirical study on recent graph database systems,” in International Conference on Knowledge Science, Engineering and Management, pp. 328–340, Springer, 2020.
  40. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186, 2019.
  41. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
  42. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
  43. E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” arXiv preprint arXiv:1904.03323, 2019.
  44. Z. He, S. Sunkara, X. Zang, Y. Xu, L. Liu, N. Wichers, G. Schubiner, R. Lee, and J. Chen, “Actionbert: leveraging user actions for semantic understanding of user interfaces,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 5931–5938, 2021.

Downloads

Download data is not yet available.

Similar Articles

11-20 of 62

You may also start an advanced similarity search for this article.