SemanticScuttle - klotz.me

klotz: umap*

7 Advanced Feature Engineering Tricks Using LLM Embeddings

This article details seven advanced feature engineering techniques using LLM embeddings to improve machine learning model performance. It covers techniques like dimensionality reduction, semantic similarity, clustering, and more.

The article explores how to leverage LLM embeddings for advanced feature engineering in machine learning, going beyond simple similarity searches. It details seven techniques:

1. **Embedding Arithmetic:** Performing mathematical operations (addition, subtraction) on embeddings to represent concepts like "positive sentiment - negative sentiment = overall sentiment".
2. **Embedding Clustering:** Using clustering algorithms (like k-means) on embeddings to create categorical features representing groups of similar text.
3. **Embedding Dimensionality Reduction:** Reducing the dimensionality of embeddings using techniques like PCA or UMAP to create more compact features while preserving important information.
4. **Embedding as Input to Tree-Based Models:** Directly using embedding vectors as features in tree-based models like Random Forests or Gradient Boosting. The article highlights the importance of careful handling of high-dimensional data.
5. **Embedding-Weighted Averaging:** Calculating weighted averages of embeddings based on relevance scores (e.g., TF-IDF) to create a single, representative embedding for a document.
6. **Embedding Difference:** Calculating the difference between embeddings to capture changes or relationships between texts (e.g., before/after edits, question/answer pairs).
7. **Embedding Concatenation:** Combining multiple embeddings (e.g., title and body of a document) to create a richer feature representation.

2026-02-04 Tags: llm, embeddings, feature engineering, machine learning, semantic similarity, dimensionality reduction, clustering, pca, umap, t-sne by klotz

I Was Wrong: Start Simple, Then Move to More Complex

The author discusses a shift in approach to clustering mixed data, advocating for starting with the simpler Gower distance metric before resorting to more complex embedding techniques like UMAP. They introduce 'Gower Express', an optimized and accelerated implementation of Gower.

2025-09-05 Tags: clustering, data science, machine learning, gower distance, umap, gower express, mixed data, python, scikit-learn, data analysis, shrunk by klotz

Command Line Utility | Embedding Atlas

This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.

2025-08-13 Tags: embedding, text, data, visualization, umap, sentence transformers, command line, hugging face, parquet, duckdb by klotz

ArXiv Data Map

A visual representation of papers on ArXiv using UMAP and nomic-embed.

2024-10-12 Tags: arxiv, umap, data, map, visualization, clustering by klotz

A Visual Exploration of Semantic Text Chunking

The article explains semantic text chunking, a technique for automatically grouping similar pieces of text to be used in pre-processing stages for Retrieval Augmented Generation (RAG) or similar applications. It uses visualizations to understand the chunking process and explores extensions involving clustering and LLM-powered labeling.

2024-09-21 Tags: text, chunking, nlp, rag, dimensionality reduction, hierarchical clustering, umap, summarization, llm by klotz

How to Analyze 100-Dimensional Data with UMAP in Breathtakingly Beautiful Ways | Towards Data Science

2021-10-22 Tags: umap, visualization, best practices, data science by klotz

Clustering sentence embeddings to identify intents in short text | by David Borrelli | Oct, 2021 | Towards Data Science

2021-10-20 Tags: embedding, intent, umap, hdbscan, classification, machine learning by klotz

Chapter 14. Maximizing similarity with t-SNE and UMAP – Machine Learning with R, the tidyverse, and mlr – Dev Tutorials

2021-09-15 Tags: t-sne, umap, r by klotz

Word2Vec + UMAP

2021-09-08 Tags: word2vec, umap, visualization, t-sne by klotz

How Exactly UMAP Works - Towards Data Science

2019-10-04 Tags: umap, tsne, dimensionality reduction by klotz

First / Previous / Next / Last / Page 1 of 0