Investigators and consortium administrators are faced with many challenges when looking to co-analyze harmonized data. For example, ethical, legal, and consent-related restrictions often limit the transfer of data to external users. The data infrastructure used to support the co-analysis of data necessarily depends on such restrictions.

There are three general approaches to analyzing harmonized data across collaborating studies: Pooled data analysis, Summary data meta-analysis, and Federated data analysis. The first two approaches, pooling individual-level data in a central location and meta-analyzing summary data from participating studies, are commonly used in multi-centre research projects. In addition, Maelstrom Research and its partners are proposing a new method for co-analyzing harmonized data across multiple studies: performing federated analysis of geographically-dispersed datasets.

Pooled data analysis

When analysis is to be undertaken using data from several sources, efficiency and flexibility can be best served by working directly with pooled individual-participant data rather than by meta-analyzing summarized results. Using this model, individual-participant data are physically transferred to a central server where they are harmonized. This data is then analyzed as if it was from a single study, but study heterogeneity terms are incorporated in statistical models if required.

The figure to the left shows the procedures and actions necessary to undertake pooled data analysis both locally (on the computers within the host institution of participating studies) and centrally (on a central computer).

While the pooled data analysis approach provides flexibility in conducting statistical analyses since data from all studies are centrally stored, managed and analyzed, there are major governance, ethical and legal challenges to physically pooling data. For example, ethico-legal constraints such as the wording of consent forms and privacy legislation often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries.

 

Summary data meta-analysis

Summary data meta-analysis is a popular method for combining results across multiple studies in order to identify patterns or sources of disagreement across studies or to increase statistical power in order to better identify and quantify effects of risk factors on outcomes. Using this approach, investigators will begin by harmonizing data across participating studies in order to ensure that compatible constructs are being compared. Statistical analyses of individual participant data are then carried out separately for each study (i.e. on local computers) to produce study-level estimates (these study-level estimates can also be obtained from already existing publications). Study-level estimates are then pooled using conventional meta-analysis to obtain a weighted average that combines individual study-level results. 

The figure to the left shows the procedures and actions necessary to undertake summary data meta-analysis both locally (on the computers within the host institution of participating studies) and centrally (on a central computer).

Summary data meta-analysis requires limited ethics review and data access procedures given that what is shared between each study and the investigator undertaking the analysis are summary statistics rather than potentially disclosive individual-level data. However, the analyses available to the researcher are limited to the summary statistics produced by each study. If new questions arise or additional parameters are needed, each individual study must then produce new summary statistics.

 

Federated data analysis

The federated data analysis model is equivalent to pooled individual participant data analysis but individual participant data remains on local servers while a federated IT system enables co-analysis of those individual participant data.

The figure to the left shows the procedures and actions necessary to undertake summary data meta-analysis both locally (on the computers within the host institution of participating studies) and centrally (on a central computer).

Our team has developed open-source tools to support groups of studies interested in undertaking federated data analysis. DataSHIELD (www.datashield.org)1, the methodology developed to achieve this, essentially coordinates parallelized simultaneous analysis of the individual-level data hosted on geographically-dispersed servers. To do so, a secure internet connection (HTTPS) is set up between a central analysis computer and the servers hosting the harmonized individual participant data. Through this connection, the central computer sends blocks of code to each data computer, requesting them to undertake a particular analysis and to return non-disclosive summary statistics. Throughout the process individual participant data from contributing studies are held securely on geographically-dispersed servers. Analyses are performed locally so all data stays at source, within the governance structure and control of the originating study.

The federated data analysis approach enables collaborating studies to participate in combined analyses in a secure, scalable and sustainable manner. Unlike data sharing initiatives based on central data deposition, the federated data analysis approach allows studies to remain in complete control of their data. Unlike meta-analysis of study-level estimates, the federated approach allows investigators to safely and remotely analyze data at their convenience and in real time avoiding the important delays related to waiting for each individual study to produce and provide the required summary statistics. 

 

1. DataSHIELD acts as an interface module between Maelstrom’s Opal software and the R statistical environment. Learn more about DataSHIELD and the Maelstrom Research resources in the Software section.