It’s widely known that the consequences of poor data quality are growing every day, so why is data quality still an issue? In this Q&A, Ataccama’s VP of platform strategy Marek Ovcacek discusses why enterprises haven’t made more progress on data quality and how a data quality fabric can help.
Upside: Enterprises know that poor data results in poor decisions, but even though data quality has become part of every enterprise’s data strategy, it remains elusive. Why haven’t enterprises made greater progress in upping the quality of their data?
Marek Ovcacek: Modern organizations’ data landscapes have become exceedingly complex. There is a net of different processes, transformations, and data pipelines between data creation and data consumption. Data quality (DQ) needs to be tracked on its journey through all these layers because there can be potential DQ issues at every point.
For example, there could be process issues as data moves through an organization, such as poor integration or technical accidents. Data could be old and outdated. Different data points could be mistaken for each other. In a lot of cases, even tracking the data lineage through the organization is a very difficult task.
Given the sheer scale of the data quality task, it has to be solved through automation via metadata-driven or (ideally) AI-assisted approaches. This leads many organizations to dump the problem onto IT. Unfortunately, IT cannot solve the problem alone — data quality must be a business priority that IT and the business must solve collaboratively.
What best practices can you recommend an enterprise follow to improve its data quality?
The first step is to properly catalog your data and keep that metadata information fresh. To do this, you need to have a process that automatically examines data at its source so you can better interpret and understand it. This includes using a data profiler that examines various statistics, identifies data domains and data patterns, and infers dependencies and relationships. Ultimately, it provides an overview of where information of interest is located and identifies inconsistencies.
The next step is to identify and monitor the quality of the data, not only on technical parameters but also using business rules. For example, you may want to identify the format of the address or credit card number field but also check with reference data or running checksums and row-level and aggregation controls. This is a time-consuming process to set up and maintain, so the ideal tooling uses a combination of metadata-driven automation, AI automation (on the orchestration level), and self-learning anomaly detection (i.e., rule-less DQ).
Finally, you need to automate data cleansing and transformation, which includes standardizing formats, breaking data down into separate attributes (such as transforming a full name into a first name and surname), enriching data with external sources, and removing duplicates. This process should occur any time data is being consumed, be it for analysis, before the data preparation phase, or loading to the target system. The tooling you are looking for here should also support automation and provide a wide variety of integration options.
Sources: https://tdwi.org/articles/2022/03/17/diq-all-why-is-data-quality-so-elusive.aspx