tokenizing and stemming each synopsis
transforming the corpus into vector space using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling using Latent Dirichlet Allocation (LDA)
In your example if you use PCA to initialize your t-SNE you get widely spaced centroids; if you use random initialization you'll get tiny centroids and an uninteresting picture.