Cheat sheet for boards and C-suite execs: What does ‘good data quality’ mean?
Elin Hauge. Foto: Kristoffer Sandven
We often hear that data quality is an essential success factor for AI and a long list of other digital technologies. But what does that mean? I'm pretty sure that you are not alone in asking yourself this question, but not wanting to ask it out loud in fear of coming across as the most stupid person in the room. Don't worry, you are not. Actually, the question of data quality should be on the board's table more often.
The short answer: good data quality means a state of data that is suitable, representative, and trustworthy for the intended purposes.
Key elements are outlined below. This article was first published as part of a five-article series on Beyond the obvious, a collaborative project by Eirik Norman Hansen and myself.
Accuracy
To which extent does your data accurately represent the real-world entities, events, and processes it describes? If you are a building authority, and you want to use aerial town photos to train an AI model to automate the application procedure for garages, you need to make sure that the aerial town photos include garages and that you are able to identify which of the buildings in the photo are actually garages and not, say, greenhouses.
Completeness
Are all required data elements present? Are there missing values or incomplete records that could impact analysis or decision-making? One such example is the Dutch SyRI case, where the Tax and Customs Administration automated a fraud detection system for social benefits. The fundamental data quality problem was that their data mainly represented urban areas with a high fraction non-western immigrants and low average income. Naturally, their fraud detection algorithm indicated that fraud would happen in low-income families with non-western immigration background. Surprise! The Dutch government resigned over this scandal. Do not become the Dutch government.
Consistency
Are your data uniform and coherent across systems, data sets, and time periods? Or perhaps different teams at the construction site have different procedures for how they report logistics and resource utilisation? If so, your data may not tell you the correct story to support the strategic resource allocation decisions.
Timeliness
Are your data up to date? Or are they sooooo-last-year? Keep in mind the potential consequences of automating decisions based on outdated data… even if regulators let it pass, your customers may very well shred you to pieces.
Validity
Are your data in a format that conforms to your business standard? Do you have the necessary consents from your customers or employees to use the data for the intended purposes? Keep in mind that lack of consent is a data privacy breach.
Uniqueness
Have you filled your databases with duplicates, triplicates, multiplicates? If so, large volumes of data may be completely redundant. Redundant data is waste and a cyber security risk. Unique data, on the other hand, provide useful insights into the variety of entities, events, and processes of your business operations.
Reliability
Are your data credible and not tampered with? If you automate processes based on your data, are you confident that the data tells you the right story?
Relevance
Are the data useful for the specific context of your organisation and business line? If not, you may not even be allowed to keep them.
To summarize, data quality may be compared to cooking; the data are your ingredients, the dish is the purpose of the ingredient processing. If you have the wrong ingredients, in wrong volumes, your tomatoes are rotten, and somebody threw peanuts into the butter, the end result is highly likely to become disastrous. If the customer has peanut allergy and you are the owner of the restaurant, you may even end up with a lawsuit for not keeping your kitchen in order.
Do you need help? Feel free to reach out.