Isabel Segura-Bedmar, V´ıctor Suarez-Paniagua, Paloma Mart ´ ´ınez
Computer Science Department
University Carlos III of Madrid, Spain
This paper describes a machine learningbased
approach that uses word embedding
features to recognize drug names from
biomedical texts. As a starting point,
we developed a baseline system based on
Conditional Random Field (CRF) trained
with standard features used in current
Named Entity Recognition (NER) systems.
Then, the system was extended to
incorporate new features, such as word
vectors and word clusters generated by
the Word2Vec tool and a lexicon feature
from the DINTO ontology. We trained the
Word2vec tool over two different corpus:
Wikipedia and MedLine. Our main goal
is to study the effectiveness of using word
embeddings as features to improve performance
on our baseline system, as well as
to analyze whether the DINTO ontology
could be a valuable complementary data
source integrated in a machine learning
NER system. To evaluate our approach
and compare it with previous work, we
conducted a series of experiments on the
dataset of SemEval-2013 Task 9.1 Drug
Name Recognition.
I used a Python t-SNE library to reduce the 200 feature dimensions for each word to 2 dimensions and plotted them in matplotlib. I saved out the x/y coordinates for each word in the book, so that I can show those words on the graph as you mouse over the replaced (blue) words.
If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim
s