In our last post, we have seen a theoretical framework for classifying arguments around data quality. It has helped us to get a rough overview over the different perspectives at play in the discussion. Now, that we can see data quality from different angles, we’re heading to what aspects of data quality are most relevant in practice and what we can do to achieve it.
The Empirical Approach
Richard Wang and Diane Strong conducted a very interesting piece of research in the 1990s. In the first step, they asked data consumers to list all attributes that come to their mind when thinking about data quality. In the second step, these attributes were ranked by importance. A factor analysis consolidated the initial 179 attributes to a smaller set of data quality dimensions in four major categories.
Intrinsic Data Quality
Intrinsic Data Quality includes “Accuracy” and “Objectivity”, meaning the data needs to be correct and without partiality. While these two dimensions seem to be pretty self-explanatory, “Believability” and “Reputation” are not so obvious. It’s quite interesting that they are not about the data itself but they refer to the source of data, either the respondents or the fieldwork provider: respondents need to be real and authentic, while the fieldwork provider should be trustworthy and serious.
Contextual Data Quality
Contextual Data Quality means, that some aspects of data quality can only be assessed in the light of the corresponding task at hand. As this context can vary a lot, attaining a high contextual data quality is not always easy. Most of the contextual dimensions (Value-added, Relevancy, Timeliness, Completeness, Appropriate amount of data) require thorough planning before setting up and conducting the research. Conversely, it is really hard to improve contextual data quality once it has been collected (e.g. reminders to improve completeness).
Representational Data Quality
Representational data quality refers to the way, data is formatted (concise and consistent) and the degree to which you can derive meaning from it (interpretability and ease of understanding). Simply imagine the data validation routines for an online survey. When asking for the respondents’ age for example, you would make sure everyone (consistently) enters the age in whole years (concisely) or even within the age groups you’re particularly interested in (ease of understanding). In any case, the respondent will be hindered from submitting erroneous or extreme values (interpretability).
Accessibility Data Quality
The two dimension within this category can be opposed, and, therefore, require a good balance. Accessibility is about how easy and effortless data can be retrieved, while Access Security is about how the access can be limited and controlled. These aspects have received an increasing attention during the last years – e.g. online dashboards or data warehouses.
Towards an Excellent Data Quality
As you can see, “Intrinsic Data Quality” mainly depends on selecting the right data source, “Contextual Data Quality” on planning the study thoroughly, “Representational Data Quality” on collecting the data in the right way and “Accessibility Data Quality” on reporting the data correctly. Or, more general, at each stage of the research process we have to deal with different tasks and challenges in order to achieve the best possible outcome.
In our last article we have discussed how different perspectives on data quality can sometimes compete. While it’s still valid that the requirements of all stakeholders need to be addressed in the first place, it’s possibly even more important that every tie in the value chain is contributing to the overall quality when collecting and processing the data. As research has become a complex process with divided responsibilities, we have to make sure that quality standards are met throughout the whole process.