Background: In real-world data studies, smoking behavior is at times necessary as it is a strong determinant for health outcomes. Many data sources lack structured smoking data. Haque et al. (2024) reviewed smoking characterization in RWD studies and concluded that characterization methods from unstructured data were more valid to structured, coded data.
Objectives: In an electronic medical record (EMR) database without structured smoking, we characterized status, pack-years and related behaviors from encounter-level problem text strings.
Methods: We established a cohort of cancer-free patients, 50–80-years from IQVIA Ambulatory EMR - U.S. database with relevant index encounters, 2014-2019. Inclusion criteria required healthcare engagement in the prior 365 days; patients excluded with a cancer history or cancer therapy. To characterize smoking behavior, we searched ‘problem text strings’ for mentions of smoking, restricted to patient-specific records and not tagged as erroneous. Records were then labeled as non-, former, and current based on additional string terms for “no/not”, “previously”, “quit”, “history”, “stop%” past tense, present tense, etc. To categorize on index date, recent records for current were prioritized over former and vice versa in the same patient. Smoking was prioritized over nonsmoking and unknown status records, respectively. For no mentions of smoking at any time, patients categorized as missing. Among smokers, we characterized pack-years with similar restrictions - non-erroneous, patient-specific records. Through similar steps we characterized tobacco use, nicotine dependence, and second-hand exposure. We assessed the counts, frequencies, and associations measures with our characterized data.
Results: Our cohort consisted of 8.152M patients (Females: 58%; White: 75%; median age: 62 years; nonsmokers: 37%; current smokers: 12%; former smokers: 16%). Lung cancer diagnosis within 3 years of index encounter date occurred in 0.2% (n=15,185). We observed associations with smoking and lung cancer in current smokers (OR=7.3; 95% CI: 6.9-7.8) and former smokers (OR=4.8; 95% CI: 4.5-5.1) versus nonsmokers. Among all smokers, we could characterize pack-years in 0.34% (n=8000). We observed significant associations with greater pack-years and lung cancer: 21-30 years (OR=4.7; 95% CI: 2.1-10.4) and 31+ years (OR=3.7; 95% CI: 2.1-6.8) versus missing.
Conclusions: We leveraged unstructured data to characterize smoking and related characteristics. Our approach covered nearly 65% of all patients in our cohort. These associations suggest we have developed a characterization method in a scientifically meaningful way based on encounter-level records when structured smoking data is absent.