This article explains how to quickly detect data quality issues and identify their causes using Python for ETL pipelines. It discusses strategies to minimize the time required to fix data quality problems.
As a quick refresher, the Data Dirtiness Score estimates the expected proportion of cells in a data set that contain errors. Here are the key hypotheses behind this metric:
Data errors are related to violated constraints.
If there are no expectations, there is no effect on the score.
Data problems can be pinpointed to specific cells.
Each data error is assigned a confidence score.
Every cell has an equal impact on the overall score.
Commercially supported by Union.ai, Flyte is a Kubernetes-friendly DAG-based data pipelining framework that can type check material that has been ingested as Data Frames in the Python Pandas format. And Pandera builds on this framework by also providing additional statistical and validations checks against data, allowing an organization build out a data schema that embeds some domain knowledge around the acceptable data ranges and types.
When used together these programs can validate data as correct, throwing out alerts at runtime when they are validated. In machine learning, type safety is vitally important if for no other reason than it can save considerable time and resources.