Even if comparison and co-analysis of data across multiple studies are widely performed, data harmonization and the related decision making process are often not systematically structured and documented so as to be efficient, replicable and easily communicated. The step-by-step guidelines outlined in this section were developed by the Maelstrom Research team to fill this gap. They aim to provide a means for investigators to ensure quality, reproducibility and transparency of the results of multi-centre research.

For more information, please see: Fortier, Isabel, et al. "Maelstrom Research guidelines for rigorous retrospective data harmonization", International Journal of Epidemiology (2016): dyw075.

  • Step 0. Define the research question(s), objectives and protocol
    Before the harmonization process begins, it is necessary to develop a protocol reflecting the potential and limitations of the collaborative research project and to clearly define the research question(s) and objectives.
  • Step 1. Assemble information and select studies
    As a starting point to the harmonization process, it is required to gather appropriate knowledge and understanding of each study. This includes documenting all the relevant individual study characteristics, such as design, and type and format of data. Participating studies should also be selected based on rigorous criteria.
  • Step 2. Define variables and evaluate harmonization potential
    To ensure content equivalence of the new (harmonized) variables to be created across studies, these variables must be defined. The group of core variables targeted for harmonization is called the DataSchema. It is important to thoroughly evaluate the potential for each study to construct each DataSchema variable.

    Retrospective data harmonization requires to find a satisfactory (scientifically valid) balance between accepting only precisely uniform variables that render pooling straightforward (e.g. exact question or standard operating procedures) but limit the potential to integrate multiple studies; and accepting a certain level of heterogeneity across participating studies providing similar but not necessarily identical data. It is important to note that definitions of DataSchema variables and harmonization potential are context-specific and vary according to a project’s scientific objectives and the level of precision needed for the planned analyses. That is, information considered ‘compatible’ in one project may not be ‘compatible’ in another.

  • Step 3. Process data
    To enable data processing, it is essential to ensure that all the study-specific data items required to generate the DataSchema variables are available and that the collected information is valid. The approach used to process data under a common format will vary depending on the variables to be harmonized, the data collected by each study and the possibility to pool data.
  • Step 4. Estimate quality of the harmonized dataset(s) generated
    In order to ensure that statistical analyses are conducted on data of acceptable quality, quality control procedures must be implemented. It is essential to evaluate the quality of the harmonized dataset so as to ensure the validity of data analysis results.
  • Step 5. Disseminate and preserve the final harmonization products
    As a final step, it is essential to adequately preserve and disseminate the harmonized data and provide access to all documentation needed to understand the quality and content of the harmonized dataset.


Source: Fortier, Isabel, et al. "Maelstrom Research guidelines for rigorous retrospective data harmonization." International journal of epidemiology (2016): dyw075.