Tags: pandas* + data engineering*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. * Method chaining improves readability and reduces noise by replacing intermediate variables with a single sequence of transformations.
    * The pipe() pattern allows you to integrate complex, custom functions into a chain while keeping code testable and self-documenting.
    * Use the validate parameter in merge() to prevent unexpected row inflation from many-to-many joins and use indicator=True for easier debugging.
    * Optimize groupby operations by using transform() to add group statistics without extra merges and observed=True to avoid unnecessary computations on empty categories.
    * Replace slow apply() calls with vectorized NumPy functions like np.where() or np.select() for much faster conditional logic.
    * Avoid performance pitfalls such as iterrows(), unoptimized object dtypes, and chained assignment by using built-in vectorized methods and .loc.
  2. These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

    | Code Snippet | Explanation |
    | --- | --- |
    | `df.isnull().sum()` | Counts the number of missing values per column. |
    | `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
    | `df.describe()` | Provides basic descriptive statistics of numerical columns. |
    | `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
    | `df.nunique()` | Counts the number of unique values per column. |
    | `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
    | `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
    | `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
    | `df.dtypes` | Lists the data type for each column in the DataFrame. |
    | `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |
  3. An exploration of the benefits of switching from the popular Python library Pandas to the newer Polars for data manipulation tasks, highlighting improvements in performance, concurrency, and ease of use.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "pandas+data engineering"

About - Propulsed by SemanticScuttle