Decision-making and business strategies are based on applications that integrate different and heterogeneous data coming from internal or external sources. The algorithms for machine learning (ML) and artificial intelligence (AI) used in these applications are as good as the quality of the data used during their implementation.
High quality data is needed for high quality models. However, the call for high quality around ML and AI remains often without specification of what high quality means. If the quality of data coming from these sources is unknown, the decisions and strategies based on them may entail unpleasant surprises, incomplete and/or biased data can lead to erroneous outcomes that could even contravene. Therefore, data quality is becoming a cornerstone for the success of applications that make use of big and/or open data for which often volume is considered high quality. However, it has appeared tough to come up with a widely accepted definition of data quality. Various definitions of data quality can be found in literature, ranging from defining some data quality dimensions (such as accuracy, completeness, timeliness, usability, relevance and reliability) to more comprehensive definitions. Some of the dimensions, e.g., accuracy, may be classified as objective while other dimensions are subjective, e.g., relevance. This holds also for the comprehensive definitions of data quality.
For example, an objective comprehensive definition is that data should adequately represent (parts of) the real world, while a subjective definition is that data should be fit for the use of an application. Although the definitions of data quality may look different, the comprehensive definitions subsume the core elements of the more specific definitions.