- Challenges in measuring similarity between unstructured text data like movie descriptions.
- Simple NLP methods may not yield meaningful results; thus, a controlled vocabulary is proposed.
- Using an LLM, a genre list is generated for movie titles, which helps improve the similarity model.
A function is created to find the most similar movies to a given title based on cosine similarity scores.
Network visualization highlights clusters of genres linked via movies, showcasing potential improvements in recommender systems.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets. We applied it on data sets with up to 30 million examples. The technique and its variants are introduced in the following papers: