SemanticScuttle - klotz.me

klotz: pandas*

Pandas is a powerful, open-source data analysis and manipulation library for Python, primarily used in the fields of data science, machine learning, and technical computing. It provides efficient, flexible, and easy-to-use data structures for handling and data analysis, including dataframes and series. Pandas is built on top of the NumPy library and is used for data manipulation and analysis, making it a popular choice for data-driven applications. It is widely used in data engineering, data science, and machine learning projects, offering tools for data cleaning, transformation, and visualization. The library is designed to work with in-memory data and is optimized for performance, making it suitable for handling large datasets. Pandas is also compatible with other libraries like Matplotlib for data visualization.

How to Work With Polars LazyFrames

Learn how to create and use Polars LazyFrames for efficient data processing. Discover lazy evaluation, predicate and projection pushdown, and how to handle large datasets.

2025-02-28 Tags: polars, lazyframe, data science, pandas, spark by klotz

Speed up Pandas Code with NumPy

This article discusses how to improve the performance of Pandas operations by using vectorization with NumPy. It highlights alternatives to the apply() method on larger dataframes and provides examples of using NumPy's lesser-known methods like where and select to handle complex if/then/else conditions efficiently.

2025-01-14 Tags: pandas, numpy, vectorization, dataframes, performance by klotz

Advanced Pandas Techniques for Data Processing and Performance

The article explores 11 essential tips for leveraging the full potential of the Pandas library to boost productivity and streamline workflows in handling and analyzing complex datasets. It uses a real-world dataset from Kaggle's Airbnb listings to illustrate techniques such as chunked processing and parallel execution.

2025-01-10 Tags: pandas, performance, data science, pratheesh shivaprasad by klotz

10 Pandas One-Liners for Quick Data Quality Checks

These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

Code Snippet	Explanation
`df.isnull().sum()`	Counts the number of missing values per column.
`df.duplicated().sum()`	Counts the number of duplicate rows in the DataFrame.
`df.describe()`	Provides basic descriptive statistics of numerical columns.
`df.info()`	Displays a concise summary of the DataFrame including data types and presence of null values.
`df.nunique()`	Counts the number of unique values per column.
`df.apply(lambda x: x.nunique() / x.count() * 100)`	Computes the percentage of unique values for each column.
`df.isin( value » ).sum()`	Counts the number of occurrences of a specific value across all columns.
`df.applymap(lambda x: isinstance(x, type_to_check)).sum()`	Counts the number of values of a specific type (e.g., int, str) per column.
`df.dtypes`	Lists the data type for each column in the DataFrame.
`df.sample(n)`	Returns a random sample of n rows from the DataFrame.

2025-01-03 Tags: pandas, data quality, one-liners, data cleaning, python, data engineering by klotz

Three Important Pandas Functions You Need to Know

Mastering specific Pandas functions can enhance data manipulation skills for data scientists using Python, focusing on less explored methods for data transformation and analysis.

2025-01-02 Tags: pandas, python, data science, apply, data pipeline by klotz

PyStore - Fast data store for Pandas timeseries data

PyStore is a simple (yet powerful) datastore for Pandas dataframes, designed with storing timeseries data in mind. It leverages Pandas, Numpy, Dask, and Parquet (via pyarrow) for efficient data handling.

2024-12-21 Tags: pystore, pandas, timeseries, datastore, dask, parquet, pyarrow, shrunk by klotz

Building a Knowledge Graph From Scratch Using LLMs

Turn your Pandas data frame into a knowledge graph using LLMs. Learn how to build your own LLM graph-builder, implement LLMGraphTransformer by LangChain, and perform QA on your knowledge graph.

2024-11-26 Tags: knowledge graph, llm, langchain, llmgraphtransformer, pandas, rag, data science by klotz

How to Reset a pandas DataFrame Index

Reset a pandas DataFrame index

2024-11-07 Tags: pandas, dataframe, index, python, data science by klotz

You Don’t Need Matplotlib When Pandas Is Enough for Data Visualisation

This article demonstrates how to use Pandas plotting capabilities for common data visualization tasks, suggesting that Pandas can be sufficient for routine EDA without relying on libraries like Matplotlib.

2024-07-22 Tags: pandas, data visualization, matplotlib, eda, python by klotz

How moving from Pandas to Polars made me write better code (without writing better code)

An exploration of the benefits of switching from the popular Python library Pandas to the newer Polars for data manipulation tasks, highlighting improvements in performance, concurrency, and ease of use.

2024-07-13 Tags: pandas, polars, data engineering, python, dataframe by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: pandas*

Linked Tags

Related Tags