klotz: data mining*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This post introduces **GIST (Greedy Independent Set Thresholding)**, a new algorithm for selecting diverse and useful data subsets for machine learning. GIST tackles the NP-hard problem of balancing diversity (minimizing redundancy) and utility (relevance to the task) in large datasets.

    **Key points:**

    * **Approach:** GIST prioritizes minimum distance between selected data points (diversity) then uses a greedy algorithm to approximate the highest-utility subset within that constraint, testing various distance thresholds.
    * **Guarantee:** GIST is guaranteed to find a subset with at least half the value of the optimal solution.
    * **Performance:** Experiments demonstrate GIST outperforms existing methods (Random, Margin, k-center, Submod) in image classification and single-shot downsampling.
    * **Application:** Already used to improve video recommendation diversity at YouTube.

    **GIST provides a mathematically grounded and efficient solution for selecting high-quality data subsets for machine learning, crucial as datasets scale.**
    .
  2. This example demonstrates Density-Based Spatial Clustering of Applications with Noise (DBSCAN) using scikit-learn, showing how to generate synthetic clusters, compute DBSCAN clustering, and visualize the results, including core and non-core samples.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: data mining

About - Propulsed by SemanticScuttle