STEP 6: Data validation#

Data validation in a research project refers to the process of ensuring the quality and accuracy of the data collected during the study (e.g., Breck et al., 2019 for machine learning projects and our tool ARIADNE for specific resources). Accordingly, quality control refers to the continuous process of evaluating the data or procedures such as SOPs for completeness, accuracy, and consistency, and identifying and removing any errors or outliers (Freire, 2021). This may include checks for missing data, incorrect data entry, or other issues that could impact the validity of the study and subsequent interpretation of the results, but also assuring your data is FAIR (➜ FAIR data or ➜ RDMkit), which will facilitate the later publication of the data along with the paper (Step 9). Data wrangling, also known as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics (see Table 1; Endel & Piringer, 2015; Kandel et al., 2011). This step has the ultimate goal of cleaning, organizing, documenting, and preserving the data for future use. This may include creating detailed metadata, documenting the data collection and cleaning process, and storing the raw and processed data in a secure and accessible format (which might mean that the software [version] used to gather and process data has to be stored as well). However, aspects like data quality, merging data from different sources, creating reproducible processes, and data provenance are equally important. Importantly, these validation practices should be implemented throughout the data collection. In sum, this step contributes essentially to the robustness of the study’s findings and the ability to replicate or build upon the research in future studies. This step can be started as soon as first data is collected, leading to the next step, data analysis.