tokenizing and stemming each synopsis
transforming the corpus into vector space using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling using Latent Dirichlet Allocation (LDA)