ow can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming), transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?
tokenizing and stemming each synopsis
transforming the corpus into vector space using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling using Latent Dirichlet Allocation (LDA)