SemanticScuttle - klotz.me » klotz: data quality

klotz: data quality*

With its latest Phi-4 reasoning model, Microsoft reckons bigger isn’t always better

Microsoft's Phi-4-Reasoning-Vision-15B model challenges the trend of ever-larger AI models by demonstrating strong reasoning capabilities with a comparatively compact size. Trained on curated reasoning data, it aims to achieve performance without the massive compute costs associated with frontier models. The model supports multimodal tasks, combining text and image understanding, and offers flexible reasoning modes for different workloads. This research highlights the importance of data quality and training strategy, suggesting that smarter training techniques can be as impactful as simply increasing model size, particularly for AI agents and practical deployments.

2026-03-12 Tags: microsoft, phi-4, reasoning, multimodal, large language models, llm, open source, ai agents, data quality by klotz

10 Pandas One-Liners for Quick Data Quality Checks

These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

| Code Snippet | Explanation |
| --- | --- |
| `df.isnull().sum()` | Counts the number of missing values per column. |
| `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
| `df.describe()` | Provides basic descriptive statistics of numerical columns. |
| `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
| `df.nunique()` | Counts the number of unique values per column. |
| `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
| `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
| `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
| `df.dtypes` | Lists the data type for each column in the DataFrame. |
| `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |

2025-01-03 Tags: pandas, data quality, one-liners, data cleaning, python, data engineering by klotz

Efficient Testing of ETL Pipelines with Python

This article explains how to quickly detect data quality issues and identify their causes using Python for ETL pipelines. It discusses strategies to minimize the time required to fix data quality problems.

2024-10-07 Tags: etl, pipelines, data quality, python, tableau, data engineering, business intelligence by klotz

Automated detection of data quality issues

As a quick refresher, the Data Dirtiness Score estimates the expected proportion of cells in a data set that contain errors. Here are the key hypotheses behind this metric:

Data errors are related to violated constraints.
If there are no expectations, there is no effect on the score.
Data problems can be pinpointed to specific cells.
Each data error is assigned a confidence score.
Every cell has an equal impact on the overall score.

2024-03-23 Tags: llm, data quality by klotz

Guide to Data Quality Management: Metrics, Process and Best Practices

2022-02-08 Tags: data quality by klotz

Machine Learning Data Gets Type Checking, Validation with Flyte, Pandera – The New Stack

Commercially supported by Union.ai, Flyte is a Kubernetes-friendly DAG-based data pipelining framework that can type check material that has been ingested as Data Frames in the Python Pandas format. And Pandera builds on this framework by also providing additional statistical and validations checks against data, allowing an organization build out a data schema that embeds some domain knowledge around the acceptable data ranges and types.

When used together these programs can validate data as correct, throwing out alerts at runtime when they are validated. In machine learning, type safety is vitally important if for no other reason than it can save considerable time and resources.

2021-09-29 Tags: machine learning, data quality, data provenance, flyte, panders, pandas, python by klotz

koaning.io: Bad Labels

2021-09-06 Tags: machine learning, labels, data quality by klotz

Adding line numbers when parsing many CSV files with Spark - Stack Overflow

monotonically_increasing_id

2020-05-26 Tags: spark, python, scala, data quality, line numbers by klotz

Site Reliability Engineering Best Practices for Data Pipelines

2019-10-08 Tags: sre, data pipeline, data quality by klotz

“Automated Data Quality Testing at Scale using Apache Spark”

2019-06-30 Tags: spark, data quality, validation, amazon, deequ, foss, scala by klotz

First / Previous / Next / Last / Page 1 of 0