These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.
| Code Snippet | Explanation |
| --- | --- |
| `df.isnull().sum()` | Counts the number of missing values per column. |
| `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
| `df.describe()` | Provides basic descriptive statistics of numerical columns. |
| `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
| `df.nunique()` | Counts the number of unique values per column. |
| `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
| `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
| `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
| `df.dtypes` | Lists the data type for each column in the DataFrame. |
| `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |
Clean data is crucial for machine learning model accuracy and benchmarking. Learn 9 techniques to clean your ML datasets, from handling missing data to automating pipelines.
The article emphasizes the importance of data cleaning in machine learning model development and benchmarking. It highlights nine techniques for cleaning datasets, ensuring accurate model comparisons and reproducibility. The techniques include using DagsHub's Data Engine for data management, handling missing data with KNN imputation and MissForest, detecting outliers with DBSCAN, fixing structural errors with OpenRefine, removing duplicates with Pandas, normalizing and standardizing data with scikit-learn, automating pipeline cleaning with Apache Airflow and Kubeflow, validating data integrity with Great Expectations, and addressing data drift with Deepchecks.
**Tools and Their Main Use**
| **Tool** | **Main Use** |
| --- | --- |
| 1. **DagsHub's Data Engine** | Data management and versioning for ML teams |
| 2. **KNN Imputation (scikit-learn)** | Handling missing data by imputing values based on nearest neighbors |
| 3. **MissForest (missingpy)** | Advanced imputation for missing values using Random Forests |
| 4. **DBSCAN (scikit-learn)** | Outlier detection and removal in high-dimensional datasets |
| 5. **OpenRefine** | Fixing structural errors and inconsistencies in datasets |
| 6. **Pandas** | Duplicate removal, data normalization, and standardization |
| 7. **Apache Airflow** | Automating data cleaning pipelines and workflows |
| 8. **Kubeflow Pipelines** | Scalable and portable automation of end-to-end ML workflows |
| 9. **Great Expectations** | Data integrity validation and setting expectations for dataset quality |
| 10. **Deepchecks** | Monitoring and addressing data drift in machine learning models |