klotz: clustering*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. An interactive 3D map visualizing over 900 agent skills sourced from the awesome-agent-skills repository. The project projects these skills into a latent space, allowing users to explore them through glowing points and a nearest-neighbor web, with options to color by topic cluster or authoring team.
    Key features and technical details:
    - Uses sentence-transformers/all-MiniLM-L6-v2 for embeddings.
    - Employs UMAP for 3D dimensionality reduction.
    - Utilizes KMeans clustering and Gemma 4 E2B for automated topic labeling.
    - Interactive interface built with Three.js featuring search, tooltips, and info panels.
  2. This article explains Pair Plots (Scatter Matrices) in Python for exploratory data analysis, showing pairwise relationships between numerical variables using scatter plots and distribution plots.

    The article provides the following Python code using `seaborn` and `matplotlib` to create a pair plot:

    ```python
    import seaborn as sns
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np

    # Create some random data
    data = np.random.rand(100, 4)
    df = pd.DataFrame(data, columns= 'A', 'B', 'C', 'D' » )

    # Create the pair plot
    sns.pairplot(df)

    # Show the plot
    plt.show()
    ```
  3. This article details seven advanced feature engineering techniques using LLM embeddings to improve machine learning model performance. It covers techniques like dimensionality reduction, semantic similarity, clustering, and more.

    The article explores how to leverage LLM embeddings for advanced feature engineering in machine learning, going beyond simple similarity searches. It details seven techniques:

    1. **Embedding Arithmetic:** Performing mathematical operations (addition, subtraction) on embeddings to represent concepts like "positive sentiment - negative sentiment = overall sentiment".
    2. **Embedding Clustering:** Using clustering algorithms (like k-means) on embeddings to create categorical features representing groups of similar text.
    3. **Embedding Dimensionality Reduction:** Reducing the dimensionality of embeddings using techniques like PCA or UMAP to create more compact features while preserving important information.
    4. **Embedding as Input to Tree-Based Models:** Directly using embedding vectors as features in tree-based models like Random Forests or Gradient Boosting. The article highlights the importance of careful handling of high-dimensional data.
    5. **Embedding-Weighted Averaging:** Calculating weighted averages of embeddings based on relevance scores (e.g., TF-IDF) to create a single, representative embedding for a document.
    6. **Embedding Difference:** Calculating the difference between embeddings to capture changes or relationships between texts (e.g., before/after edits, question/answer pairs).
    7. **Embedding Concatenation:** Combining multiple embeddings (e.g., title and body of a document) to create a richer feature representation.
  4. The author discusses a shift in approach to clustering mixed data, advocating for starting with the simpler Gower distance metric before resorting to more complex embedding techniques like UMAP. They introduce 'Gower Express', an optimized and accelerated implementation of Gower.
  5. This example demonstrates Density-Based Spatial Clustering of Applications with Noise (DBSCAN) using scikit-learn, showing how to generate synthetic clusters, compute DBSCAN clustering, and visualize the results, including core and non-core samples.
  6. A guide on how to use OpenAI embeddings and clustering techniques to analyze survey data and extract meaningful topics and actionable insights from the responses.

    The process involves transforming textual survey responses into embeddings, grouping similar responses through clustering, and then identifying key themes or topics to aid in business improvement.
  7. A visual representation of papers on ArXiv using UMAP and nomic-embed.
    2024-10-12 Tags: , , , , , by klotz
  8. An article discussing the use of embeddings in natural language processing, focusing on comparing open source and closed source embedding models for semantic search, including techniques like clustering and re-ranking.
  9. ASCVIT V1 aims to make data analysis easier by automating statistical calculations, visualizations, and interpretations.

    Includes descriptive statistics, hypothesis tests, regression, time series analysis, clustering, and LLM-powered data interpretation.

    - Accepts CSV or Excel files. Provides a data overview including summary statistics, variable types, and data points.
    - Histograms, boxplots, pairplots, correlation matrices.
    - t-tests, ANOVA, chi-square test.
    - Linear, logistic, and multivariate regression.
    - Time series analysis.
    - k-means, hierarchical clustering, DBSCAN.

    Integrates with an LLM (large language model) via Ollama for automated interpretation of statistical results.
  10. This article provides a beginner-friendly introduction to HDBSCAN, a powerful hierarchical clustering algorithm that extends the capabilities of DBSCAN by handling varying densities more effectively. It compares HDBSCAN to DBSCAN and KMeans, highlighting the advantages of HDBSCAN in handling clusters of different shapes and sizes.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: clustering

About - Propulsed by SemanticScuttle