Methods in Pharmacoepidemiology / Analytical Methods

3D-06 - Benchmarking Large Language Models and Optical Character Recognition to Enhance Network Meta-Analyses

Monday, August 25, 2025

9:30 AM - 9:45 AM ET

Location: Ballroom AB

Presenting Author(s)

AD

Andy Dang

Eli Lilly and Company, United States

Co-Author(s)

AA

Ameeta` Agrawal

Portland State University, United States
MS

Mingyang Shan

Eli Lilly and Company, United States

Background: Network meta-analyses (NMA) are routinely used to compare the relative efficacy and safety of multiple treatments to guide market access and reimbursement of treatments. Curating the data for NMA from systematic literature review (SLR) has historically been an extremely time-consuming process, but recent advancements in artificial intelligence (AI) and large language models (LLMs) have the potential to replace manual SLR and provide automated estimates of pairwise treatment effects.

Objectives: The objective of this research is to develop and evaluate a multi-stage framework to enhance the automation of data curation and downstream network meta-analysis.

Methods: This framework first applied state-of-the-art LLMs (e.g. OpenAI GPT-4o, Google Gemini 2.0 Pro, Claude 3.5 Sonnet) to facilitate SLR and extract key study PICOS information. Next, embedding-based search was applied to select the desired subset of studies to include in the comparison. LLMs (including multimodal LLMs) and optical character recognition (OCR) were then used to extract data elements from text, tables, and figures and standardize data formats to input to NMA to estimate pairwise treatment effects. The performance of our framework was evaluated on data extraction accuracy using LLMs and OCR at multiple stages, as well as the operating characteristics of NMA estimates compared to manual SLR and NMA analysis.

Results: The accuracy of various LLMs to correctly extract PICOS information during the SLR for different anti-obesity medications ranged from 87.5% to 95.5%, with Claude Sonnet 3.5 having the highest content agreement. This led to downstream bias on the selected set of studies and NMA model. Impressively after filtering the set of studies, LLMs combined with OCR were able to achieve 95-99% accuracy in extracting and formatting the inputs needed to conduct NMA.

Conclusions: Our results demonstrate the feasibility and promising performance for stagewise implementation of AI to significantly improve the efficiency of conducting SLR and NMA, while maintaining robust evidence synthesis fit for decision-making.