An interactive 3D map visualizing over 900 agent skills sourced from the awesome-agent-skills repository. The project projects these skills into a latent space, allowing users to explore them through glowing points and a nearest-neighbor web, with options to color by topic cluster or authoring team.
Key features and technical details:
- Uses sentence-transformers/all-MiniLM-L6-v2 for embeddings.
- Employs UMAP for 3D dimensionality reduction.
- Utilizes KMeans clustering and Gemma 4 E2B for automated topic labeling.
- Interactive interface built with Three.js featuring search, tooltips, and info panels.
This article details seven advanced feature engineering techniques using LLM embeddings to improve machine learning model performance. It covers techniques like dimensionality reduction, semantic similarity, clustering, and more.
The article explores how to leverage LLM embeddings for advanced feature engineering in machine learning, going beyond simple similarity searches. It details seven techniques:
1. **Embedding Arithmetic:** Performing mathematical operations (addition, subtraction) on embeddings to represent concepts like "positive sentiment - negative sentiment = overall sentiment".
2. **Embedding Clustering:** Using clustering algorithms (like k-means) on embeddings to create categorical features representing groups of similar text.
3. **Embedding Dimensionality Reduction:** Reducing the dimensionality of embeddings using techniques like PCA or UMAP to create more compact features while preserving important information.
4. **Embedding as Input to Tree-Based Models:** Directly using embedding vectors as features in tree-based models like Random Forests or Gradient Boosting. The article highlights the importance of careful handling of high-dimensional data.
5. **Embedding-Weighted Averaging:** Calculating weighted averages of embeddings based on relevance scores (e.g., TF-IDF) to create a single, representative embedding for a document.
6. **Embedding Difference:** Calculating the difference between embeddings to capture changes or relationships between texts (e.g., before/after edits, question/answer pairs).
7. **Embedding Concatenation:** Combining multiple embeddings (e.g., title and body of a document) to create a richer feature representation.
The author discusses a shift in approach to clustering mixed data, advocating for starting with the simpler Gower distance metric before resorting to more complex embedding techniques like UMAP. They introduce 'Gower Express', an optimized and accelerated implementation of Gower.
This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.
A visual representation of papers on ArXiv using UMAP and nomic-embed.
The article explains semantic text chunking, a technique for automatically grouping similar pieces of text to be used in pre-processing stages for Retrieval Augmented Generation (RAG) or similar applications. It uses visualizations to understand the chunking process and explores extensions involving clustering and LLM-powered labeling.