Background: Tree-based scan statistics are a class of models that leverage hierarchical classification systems to identify increased risk of outcomes in an exposed group at multiple resolutions. They are used increasingly frequently in pharmacovigilance, but given their sensitivity to the tree-structured variable input and use of resampling-based multiple testing control, their statistical properties are complex to infer.
Objectives: To establish a flexible, open source pipeline for collaboration and evaluation of tree-based scan statistics for use in pharmacovigilance.
Methods: The pipeline is designed to take as input a dataframe of the study population and diagnoses, a hierarchical tree, a target simulated outcome and effect size, and several parameters for the power calculation. It implements a plasmode simulation procedure to simulate the effect of interest while preserving the empirical associations among the variables. The tree-based scan statistic is calculated using existing open source software, and the power is defined as the proportion of simulated datasets in which the simulated outcome is detected. The pipeline can be customized to the user’s computer system to mitigate long processing times associated with large trees and numbers of replicates.
Results: The pipeline, along with instructional vignettes, will be made available via an open source platform. The pipeline will enable a phased exploration of the power of tree-based scan statistics in pharmacovigilance using trees curated for infant and maternal outcomes, as well as a general ICD-10-CM tree published by Sentinel.
Conclusions: The pipeline streamlines the evaluation of power in tree-based scan statistics, a key question in determining their utility in detecting safety signals associated with drugs, devices, and vaccines. It will facilitate resource sharing, and models will be available for researchers to determine appropriate sample sizes for tree-based scan statistic studies with a variety of hierarchical trees and target effect sizes.