STEP 6: Data validation#
Data validation in a research project refers to the process of ensuring the quality and accuracy of the data collected during the study (e.g., Breck et al., 2019 for machine learning projects). This step starts already during study design (Step 2) and should be continuously revisited throughout the data collection.
Data quality. Accordingly, quality control refers to the continuous process of evaluating the data or procedures such as SOPs for completeness, accuracy, and consistency, and identifying and removing any errors (Freire, 2021). This may include checks for missing data, incorrect data entry, or other issues that could impact the validity of the study and subsequent interpretation of the results, but also assuring your data is FAIR (“Findable, Accessible, Interoperable, and Reusable”; Wilkinson et al., 2016 (➜ FAIR data or ➜ RDMkit).
Data accuracy. Data wrangling, also known as data munging, is the process of transforming and mapping data from one “raw” data format into another format, with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics (see glossary; Endel & Piringer, 2015; Kandel et al., 2011). This step has the ultimate goal of cleaning, organizing, documenting, and preserving the data for future use. This may include creating detailed metadata, documenting the data collection and cleaning process, and storing the raw and processed data in a secure and accessible format (which might mean that the software and version used to gather and process data has to be stored as well). However, aspects like data quality, merging data from different sources, creating reproducible processes, and data provenance are equally important. Regarding preprocessing of data, many fields already offer established standards (e.g., Loenneker et al., 2024 for reaction time data).
In sum, this step contributes essentially to the replicability of the study’s findings and the ability to build upon the research in future studies. This step can be started as soon as first (pilot) data is collected, leading to the next step, data analysis.
Key questions in this step#
How can we ensure the quality and accuracy of the data?*
How can we store the data reproducibly?*
Questions with an asterisk indicate that these should ideally already be explored before starting a project.