Chief Scientific Officer TriNetX Needham, Massachusetts, United States
Background: Clinical trials often enroll patient populations that are not representative of the target clinical population. This results in a lack of knowledge about who would benefit from the drug if approved.
Objectives: To evaluate a process that uses multiple real-world data (RWD) sources to generate clinical trial diversity targets to help improve representativeness of trial populations in multiple disease areas.
Methods: Define the Condition of Interest: We first create a clinical phenotype using ICD10 diagnosis code(s) that best represents the disease of interest. Conditions were chosen based on known race and ethnicity (R/E) underrepresentation in clinical trials, including breast cancer, lupus, Alzheimer’s disease, among many others. Explore data sources: We then generate patient counts, stratified by R/E, for each phenotype across data source, including healthcare encounter (TriNetX and IQVIA Ambulatory Electronic Medical Record Database), government surveillance (CDC’s National Health Interview Survey [NHIS], Surveillance, Epidemiology, and End Results Program [SEER]), and published literature. For each data source and each condition, annual prevalence counts and representation by age, sex, race and ethnicity are presented. For each condition, the share of each R/E category is compared across data sources. No imputation for missing R/E values was performed.
Results: Four cancers (colon, lung, ovarian, and pancreatic) were available in all four RWD sources and four chronic conditions (asthma, chronic obstructive pulmonary disease, diabetes, and hypertension) were available in the NHIS, both EMRs, and the literature. Each data source had different strengths and weaknesses. NHIS contained complete R/E information but was limited to a subset of more common conditions. Twelve of the 27 (44.4%) conditions explored were reported in the NHIS. EMR data had large sample sizes and high specificity of disease definition, but incomplete R/E data. Less common and rare conditions appear in the EMR data, but Hispanic or Latino patients are underreported in these encounter-based data. Across all sources, the EMRs were the most likely to yield sufficient patient counts; the most robust estimates for cancer-specific conditions in the literature were derived from SEER data.
Conclusions: Based on our experience using this process to support clinical trial diversity plans, we found important differences in R/E distributions by data source, underscoring the importance of exploring multiple data sources for each condition. This presentation will demonstrate several indication-specific examples evaluating diversity within each real-world data source.