-
Background
The adoption of electronic health records (EHR) and the digitalization of healthcare is generating a deluge of clinical data. These data contain a wealth of information and insights waiting to be mined, and we owe it to our patients to learn from it. Clinical researchers and data scientists can apply a wide range of machine learning algorithms and data science methodologies to these rich data to improve diagnosis, prognosis, and therapeutic decision-making with the ultimate goal of transforming patient care. However, the promise of ‘big data’ and data science in healthcare is thwarted by significant technical hurdles. One of the major impediments is the lack of sufficiently large, curated datasets. This is particularly true in pediatric critical care. Although we capture and record large amounts of clinical data in our most vulnerable patients, these data are not widely available for researchers in an aggregated and curated format, limiting our ability to learn from it and improve our understanding of critical illness. Furthermore, any single pediatric institution will only see a relatively small number of patients with any given disease or critical care condition, severely limiting the power of single-center studies.
To address the problem, we are building the PICU Data Collaborative (PDC) where:
- Members contribute and share pediatric critical care EHR data across multiple institutions
- Data undergoes rigorous data quality assurance and harmonization to facilitate data mining, algorithm development, and model benchmarking.
- Members gather to collaborate, share ideas, discuss novel analyses and findings, develop models, and design data-driven clinical decision support tools.
Technical Overview
The PDC is comprised of institutions that contribute anonymized pediatric critical care EHR data to a shared data platform which resides in a private cloud-computing environment. Each PDC site principal investigator is responsible for collecting and aggregating data from their site’s various sources based on the PDC common data model and minimum variable set. A shared anonymization methodology and algorithm is shared across sites to ensure compliance. In the current Phase One of the PDC, a minimum EHR-based dataset from multiple sites is being aggregated, including demographics, diagnoses, vital signs, laboratory results, medications, and interventions. Future plans include expanding to additional sites and inclusion of additional data sources including clinical notes, bedside monitor waveforms, images, and multi-omic data. To facilitate research and rapid iteration of novel projects, the data platform is being designed to perform data quality assurance and harmonization centrally and to provide machine learning workflows and access to data science workspaces that allow researchers to focus on what really matters: the science.

Governance and Scientific Oversight
The PDC has both an Executive and a Scientific Committee formed my members of each participating site. The chair and vice chairs of the committees are elected by the members and serve two-year terms. The Executive Committee oversees the policies and procedures and other technical aspects of the PDC and the cloud-based data platform. The Scientific Committee oversees the data quality assurance and harmonization pipeline, the prioritization of scientific projects, and the data governance of the PDC.
By the Numbers

