SemanticScuttle - klotz.me » klotz: clustering

klotz: clustering*

An interactive 3D map visualizing over 900 agent skills sourced from the awesome-agent-skills repository. The project projects these skills into a latent space, allowing users to explore them through glowing points and a nearest-neighbor web, with options to color by topic cluster or authoring team.
Key features and technical details:
- Uses sentence-transformers/all-MiniLM-L6-v2 for embeddings.
- Employs UMAP for 3D dimensionality reduction.
- Utilizes KMeans clustering and Gemma 4 E2B for automated topic labeling.
- Interactive interface built with Three.js featuring search, tooltips, and info panels.

2026-04-21 Tags: agent skills, latent space, 3d visualization, machine learning, umap, clustering, three.js, awesome-agent-skills, vibe coding by klotz

Pair Plot Scatter Matrix

This article explains Pair Plots (Scatter Matrices) in Python for exploratory data analysis, showing pairwise relationships between numerical variables using scatter plots and distribution plots.

The article provides the following Python code using `seaborn` and `matplotlib` to create a pair plot:

```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create some random data
data = np.random.rand(100, 4)
df = pd.DataFrame(data, columns= 'A', 'B', 'C', 'D' » )

# Create the pair plot
sns.pairplot(df)

# Show the plot
plt.show()
```

2026-02-10 Tags: pair plot, scatter matrix, eda, data visualization, python, seaborn, matplotlib, data analysis, correlation, clustering, outliers by klotz

7 Advanced Feature Engineering Tricks Using LLM Embeddings

This article details seven advanced feature engineering techniques using LLM embeddings to improve machine learning model performance. It covers techniques like dimensionality reduction, semantic similarity, clustering, and more.

The article explores how to leverage LLM embeddings for advanced feature engineering in machine learning, going beyond simple similarity searches. It details seven techniques:

1. **Embedding Arithmetic:** Performing mathematical operations (addition, subtraction) on embeddings to represent concepts like "positive sentiment - negative sentiment = overall sentiment".
2. **Embedding Clustering:** Using clustering algorithms (like k-means) on embeddings to create categorical features representing groups of similar text.
3. **Embedding Dimensionality Reduction:** Reducing the dimensionality of embeddings using techniques like PCA or UMAP to create more compact features while preserving important information.
4. **Embedding as Input to Tree-Based Models:** Directly using embedding vectors as features in tree-based models like Random Forests or Gradient Boosting. The article highlights the importance of careful handling of high-dimensional data.
5. **Embedding-Weighted Averaging:** Calculating weighted averages of embeddings based on relevance scores (e.g., TF-IDF) to create a single, representative embedding for a document.
6. **Embedding Difference:** Calculating the difference between embeddings to capture changes or relationships between texts (e.g., before/after edits, question/answer pairs).
7. **Embedding Concatenation:** Combining multiple embeddings (e.g., title and body of a document) to create a richer feature representation.

2026-02-09 Tags: llm, embeddings, feature engineering, machine learning, semantic similarity, dimensionality reduction, clustering, pca, umap, t-sne by klotz

I Was Wrong: Start Simple, Then Move to More Complex

The author discusses a shift in approach to clustering mixed data, advocating for starting with the simpler Gower distance metric before resorting to more complex embedding techniques like UMAP. They introduce 'Gower Express', an optimized and accelerated implementation of Gower.

2025-09-05 Tags: clustering, data science, machine learning, gower distance, umap, gower express, mixed data, python, scikit-learn, data analysis, shrunk by klotz

Demo of DBSCAN clustering algorithm

This example demonstrates Density-Based Spatial Clustering of Applications with Noise (DBSCAN) using scikit-learn, showing how to generate synthetic clusters, compute DBSCAN clustering, and visualize the results, including core and non-core samples.

2025-04-18 Tags: dbscan, clustering, scikit-learn, machine learning, data mining, python, visualization by klotz

OpenAI Embeddings and Clustering for Survey Analysis — A How-To Guide

A guide on how to use OpenAI embeddings and clustering techniques to analyze survey data and extract meaningful topics and actionable insights from the responses.

The process involves transforming textual survey responses into embeddings, grouping similar responses through clustering, and then identifying key themes or topics to aid in business improvement.

2024-10-26 Tags: embedding, clustering, survey analysis, data science, visualization, k-means, tsne by klotz

ArXiv Data Map

A visual representation of papers on ArXiv using UMAP and nomic-embed.

2024-10-12 Tags: arxiv, umap, data, map, visualization, clustering by klotz

Working with Embeddings: Closed versus Open Source

An article discussing the use of embeddings in natural language processing, focusing on comparing open source and closed source embedding models for semantic search, including techniques like clustering and re-ranking.

2024-09-27 Tags: embeddings, natural language processing, semantic search, open source, closed source, retrieval applications, clustering, re-ranking, llm by klotz

ASCVIT V1: Automatic Statistical Calculation, Visualization, and Interpretation Tool

ASCVIT V1 aims to make data analysis easier by automating statistical calculations, visualizations, and interpretations.

Includes descriptive statistics, hypothesis tests, regression, time series analysis, clustering, and LLM-powered data interpretation.

- Accepts CSV or Excel files. Provides a data overview including summary statistics, variable types, and data points.
- Histograms, boxplots, pairplots, correlation matrices.
- t-tests, ANOVA, chi-square test.
- Linear, logistic, and multivariate regression.
- Time series analysis.
- k-means, hierarchical clustering, DBSCAN.

Integrates with an LLM (large language model) via Ollama for automated interpretation of statistical results.

2024-09-17 Tags: foss, ascvit, statistical analysis, data visualization, llm, python, streamlit, machine learning, statistics, regression, time series, clustering, eda by klotz

HDBSCAN: The Supercharged Version of DBSCAN — An Algorithmic Deep Dive

This article provides a beginner-friendly introduction to HDBSCAN, a powerful hierarchical clustering algorithm that extends the capabilities of DBSCAN by handling varying densities more effectively. It compares HDBSCAN to DBSCAN and KMeans, highlighting the advantages of HDBSCAN in handling clusters of different shapes and sizes.

2024-09-14 Tags: hdbscan, dbscan, clustering, machine learning, data science, hierarchical clustering, density-based clustering by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: clustering*

Linked Tags

Related Tags