SemanticScuttle - klotz.me » Tags: data science

Tags: data science*

0 bookmark(s) - Sort by: Date ↓ / Title /

Advanced Pandas Patterns Most Data Scientists Don’t Use Learn method chaining, pipe(), efficient joins, optimized groupby operations, and vectorized logic to write faster and cleaner pandas code,

* Method chaining improves readability and reduces noise by replacing intermediate variables with a single sequence of transformations.
* The pipe() pattern allows you to integrate complex, custom functions into a chain while keeping code testable and self-documenting.
* Use the validate parameter in merge() to prevent unexpected row inflation from many-to-many joins and use indicator=True for easier debugging.
* Optimize groupby operations by using transform() to add group statistics without extra merges and observed=True to avoid unnecessary computations on empty categories.
* Replace slow apply() calls with vectorized NumPy functions like np.where() or np.select() for much faster conditional logic.
* Avoid performance pitfalls such as iterrows(), unoptimized object dtypes, and chained assignment by using built-in vectorized methods and .loc.

2026-04-22 Tags: python, pandas, performance, style, nate rosidi, data science, data engineering by klotz

Beyond Prompting: Using Agent Skills in Data Science

How to use AI skills—reusable packages of instructions and files—to automate repetitive data science workflows. By moving beyond simple prompting into structured skills, users can maintain shorter context windows while ensuring consistent, high-quality outputs for complex tasks like data visualization or metric investigation.

* A skill consists of a SKILL.md file with metadata and detailed instructions to guide an AI through specific recurring processes.
* Using skills helps keep the main LLM context lightweight by only loading detailed resources when they are relevant to the task.
* The author demonstrates this by automating a weekly visualization habit, reducing a one-hour manual process to less than ten minutes.
* Building effective skills requires iterative testing, incorporating personal domain knowledge, and researching external best practices.
* Combining skills with Model Context Protocol (MCP) allows AI to both follow specific procedural playbooks and access external data tools seamlessly.

2026-04-19 Tags: data science, claude, llm, skills, yu dong by klotz

Write Pandas Like a Pro With Method Chaining Pipelines

Write Pandas Like a Pro With Method Chaining Pipelines
Master method chaining, assign(), and pipe() to write cleaner, testable, production-ready Pandas code

2026-04-13 Tags: pandas, pipeline pipe, splunk, data frames, python, data science by klotz

5 Useful Python Scripts for Effective Feature Selection

This article explores five Python scripts designed to streamline and automate the process of feature selection in machine learning projects. Feature selection is crucial for improving model performance, reducing complexity, and identifying the most impactful variables.
The scripts cover techniques like filtering constant features, eliminating redundant features through correlation analysis, identifying significant features using statistical tests, ranking features with model-based importance scores, and optimizing feature subsets with recursive elimination. Each script is practical, minimal, and provides detailed reports to aid in understanding the selection process.
These tools are valuable for data scientists looking to systematically evaluate feature importance and build more efficient and accurate models.

2026-03-31 Tags: python, feature selection, machine learning, data science, feature engineering, statistical analysis, model importance, recursive elimination by klotz

The Causal Inference Playbook: Advanced Methods Every Data Scientist Should Master

This article provides a comprehensive overview of advanced causal inference methods, moving beyond traditional statistical approaches. It emphasizes the importance of understanding causal relationships rather than just correlations for effective decision-making. The playbook covers techniques like instrumental variables, regression discontinuity, difference-in-differences, and causal discovery algorithms.
It discusses the assumptions required for each method and how to validate them. The author stresses the need for careful consideration of confounding variables and potential biases when attempting to establish causality. Ultimately, the article aims to equip data scientists with the tools and knowledge to draw more meaningful and actionable insights from data.

2026-03-15 Tags: causal inference, instrumental variables, regression discontinuity, difference-in-differences, causal discovery, data science, statistics, machine learning, causality by klotz

CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

CUDA 13.2 brings full support for CUDA Tile on Ampere, Ada, and Blackwell architectures, alongside enhancements to cuTile Python including recursive functions, closures, and custom reductions. Core updates include improved memory transfer APIs, reduced LMEM footprint in Windows, and a shift to MCDM for better compatibility. Math libraries gain experimental Grouped GEMM with MXFP8 and FP64-emulated cuSOLVERD. Developer tools see updates to Nsight Python, Nsight Compute, and Nsight Systems, alongside a modern C++ runtime in CCCL 3.2. CuPy also gains support for CUDA 13 and stream sharing.

2026-03-14 Tags: data science, nvidia, cuda, tile, cublas, cusolver, cuda tile by klotz

Publish your data, AI techniques, and agentic engineering work on Towards Data Science

The New Stack encourages its readers to contribute to Towards Data Science, a leading platform for data science and AI. Recognizing the increasing convergence of cloud infrastructure, DevOps, and AI engineering, the article invites practitioners to share their experiences with building and deploying AI systems. Successful TDS submissions are technically detailed, timely, and specific. Authors can also benefit from editorial support, promotion, and potential payment opportunities, while building their reputation within the AI community.

2026-03-12 Tags: ai, data science, machine learning, publishing, towards data science, agentic engineering, cloud infrastructure, devops, llm by klotz

Can LLM Embeddings Improve Time Series Forecasting? A Practical Feature Engineering Approach

This tutorial explores how to use LLM embeddings as features in time series forecasting models. It covers generating embeddings from time series descriptions, preparing data, and evaluating the performance of models with and without LLM embeddings.

2026-02-28 Tags: time series, forecasting, llm, embeddings, feature engineering, machine learning, natural language processing, transformers, data science, production engineering by klotz

Learn Python and Build Autonomous Agents

This course takes you from Python fundamentals to AI Agent development, covering core Python, NumPy, Pandas, SQL, Flask, FastAPI, LLMs, and open-source models via HuggingFace.

2026-02-28 Tags: python, agents, llm, huggingface, fastapi, flask, numpy, pandas, sql, data science by klotz

Choosing Between PCA and t-SNE for Visualization

PCA and t-SNE are popular dimensionality reduction techniques used for data visualization. This tutorial compares PCA and t-SNE, highlighting their strengths and weaknesses, and provides guidance on when to use each method.

This article from Machine Learning Mastery discusses when to use Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction and data visualization. Here's a summary of the key points:

* **PCA is a linear dimensionality reduction technique.** It aims to find the directions of greatest variance in the data and project the data onto those directions. It's good for preserving global structure but can distort local relationships. It's computationally efficient.
* **t-SNE is a non-linear dimensionality reduction technique.** It focuses on preserving the local structure of the data, meaning points that are close together in the high-dimensional space will likely be close together in the low-dimensional space. It excels at revealing clusters but can distort global distances and is computationally expensive.
* **Key Differences:**
* **Linearity vs. Non-linearity:** PCA is linear, t-SNE is non-linear.
* **Global vs. Local Structure:** PCA preserves global structure, t-SNE preserves local structure.
* **Computational Cost:** PCA is faster, t-SNE is slower.
* **When to use which:**
* **PCA:** Use when you need to reduce dimensionality for speed or memory efficiency, and preserving global structure is important. Good for data preprocessing before machine learning algorithms.
* **t-SNE:** Use when you want to visualize high-dimensional data and reveal clusters, and you're less concerned about preserving global distances. Excellent for exploratory data analysis.
* **Important Considerations for t-SNE:**
* **Perplexity:** A key parameter that controls the balance between local and global aspects of the embedding. Experiment with different values.
* **Randomness:** t-SNE is a stochastic algorithm, so results can vary. Run it multiple times to ensure consistency.
* **Interpretation:** Distances in the t-SNE plot should not be interpreted as true distances in the original high-dimensional space.

In essence, the article advises choosing PCA for preserving overall data structure and speed, and t-SNE for revealing clusters and local relationships, understanding its limitations regarding global distance interpretation.

2026-02-13 Tags: pca, t-sne, dimensionality reduction, visualization, machine learning, data science by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: data science*

Linked Tags

Related Tags