Word embeddings are suitable for use with neural network language models (as will be discussed later); they can also be used to enhance conventional (MEMM, CRF) models. The best ways to incorporate embeddings into such feature-based language models are still being explored. The simplest approach involves the direct use of the vector components as features (Turian et al 2010, Word Representations: A Simple and General Method for Semi-Supervised Learning, ACL 2010; Nguyen and Grishman, ACL 2014). Less direct approaches include building clusters from the embeddings and then using the clusters as features, or selecting prototypical examples of each type and then using similarity to these prototypes (based on embedding similarity) as features. Early results on NE tagging indicate a small advantage for the indirect methods (Guo et al., Revisiting embedding features for simple semi-supervised learning, EMNLP 2014). Models based on word embeddings are producing the best performance on named entity recognition (A. Passos et al, Lexicon Infused Phrase Embeddings for Named Entity Resolution, CoNLL 2014) and are effective for chunking (Turian et al ACL 2010).
Isabel Segura-Bedmar, V´ıctor Suarez-Paniagua, Paloma Mart ´ ´ınez
Computer Science Department
University Carlos III of Madrid, Spain
This paper describes a machine learningbased
approach that uses word embedding
features to recognize drug names from
biomedical texts. As a starting point,
we developed a baseline system based on
Conditional Random Field (CRF) trained
with standard features used in current
Named Entity Recognition (NER) systems.
Then, the system was extended to
incorporate new features, such as word
vectors and word clusters generated by
the Word2Vec tool and a lexicon feature
from the DINTO ontology. We trained the
Word2vec tool over two different corpus:
Wikipedia and MedLine. Our main goal
is to study the effectiveness of using word
embeddings as features to improve performance
on our baseline system, as well as
to analyze whether the DINTO ontology
could be a valuable complementary data
source integrated in a machine learning
NER system. To evaluate our approach
and compare it with previous work, we
conducted a series of experiments on the
dataset of SemEval-2013 Task 9.1 Drug
Name Recognition.