Why integrate or compare data across studies?

The quality and breadth of data and samples collected by studies around the world are undoubtedly providing invaluable opportunities to advance knowledge. However, few individual studies provide the large sample sizes and accompanying statistical power required to investigate relatively rare diseases or to address the complex interaction between lifestyle, behaviors, genetic factors and the social and physical environment. Further, heterogeneity between studies limits our ability to explore similarities and differences across countries or geographic areas. The quest for larger sample sizes, the need for valid cross-study comparisons, and necessity to make optimal use of available data has led to an increasing interest in co-analyzing data across studies. But to permit valid comparison or integration, data items from individual studies must be “harmonized”. 

What is data harmonization?

Harmonization involves achieving or improving comparability of similar measures collected by separate studies or databases for different individuals. Some research programs foster prospective implementation of harmonized measures to collect data across studies, while others turn their efforts to retrospective harmonization and co-analysis of existing datasets. To learn more about Maelstrom methods click here.

Why harmonize data?

Studies conceived at different times to meet different needs usually differ in their design and methodologies. There are many different sources of heterogeneity between studies such as criteria for recruiting participants, tools for collecting data, and variable formats and annotation schemes. To be meaningfully integrated or compared, the data collected by different studies must be processed to provide the same meanings (i.e., measure inferentially equivalent facts or concepts) and formats (e.g., same categories and coding), and thus be compatible.

In order to ensure compatibility, investigators can foster prospective harmonization (i.e., implement standard procedures across studies prior to data collection). This renders data integration relatively straightforward since compatible protocols and data collection tools are employed across studies - resulting in uniform datasets. However, it is not always relevant or possible to implement common protocols across studies. Given the costs and complexity associated with prospective data harmonization, investigators in many fields of research are increasingly opting for retrospective harmonization to support integration or comparison of data collected across pre-existing studies. Since the datasets have already been collected, retrospective harmonization also facilitates an efficient use of existing research data. However, the relatively large heterogeneity between datasets compared to prospectively harmonized data represents a challenge.