In this article we will look at using pre trained word vector embedding for sequence classification using LSTM
Using Pre-Trained Word Vector Embeddings
In the article NLP spaCy Word and document vectors we saw how to get the word vector representation trained on common crawl corpus provided by spacy toolkit. We will use this pretrained word vector representation rather than training our own Embedding Layer
We cannot use the dataset provided by keras directly so we download the raw database
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
We will use the following script to extract the data
#this function will convert text to lowercase and will disconnect punctuation and special symbols from words
function normalize_text {
awk '{print tolower($0);}' < $1 | sed -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/"/ " /g' \
-e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' -e 's/\?/ \? /g' \
-e 's/\;/ \; /g' -e 's/\:/ \: /g' > $1-norm
}
cd ..
mkdir nbsvm_run; cd nbsvm_run
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xvf aclImdb_v1.tar.gz
rm aclImdb_v1.tar.gz
for j in train/pos train/neg test/pos test/neg; do
for i in `ls aclImdb/$j`; do cat aclImdb/$j/$i >> temp; awk 'BEGIN{print;}' >> temp; done
normalize_text temp
mv temp-norm aclImdb/$j/norm.txt
rm temp
done
mkdir data
mv aclImdb/train/pos/norm.txt data/train-pos.txt
mv aclImdb/train/neg/norm.txt data/train-neg.txt
mv aclImdb/test/pos/norm.txt data/test-pos.txt
mv aclImdb/test/neg/norm.txt data/test-neg.txt
The above script store postive examples in file data/train-pos.txt
and negative examples in data/train-neg.txt
.
we will use spacy to get word vector embeddings
import imdb
import os
import spacy
def pad_vec_sequences(sequences, maxlen=40):
new_sequences = []
for sequence in sequences:
orig_len, vec_len = np.shape(sequence)
if orig_len < maxlen:
new = np.zeros((maxlen, vec_len))
for k in range(maxlen-orig_len,maxlen):
new[k:, :] = sequence[k-maxlen+orig_len]
else:
new = np.zeros((maxlen, vec_len))
for k in range(0,maxlen):
new[k:,:] = sequence[k]
new_sequences.append(new)
return np.array(new_sequences)
def test():
nlp1 = spacy.load('en')
f="data/train-pos.txt"
train_data=[]
Xtrain=[]
labels=[]
for sentence in open(f).xreadlines():
row={}
t=[]
try:
sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])
r=unicode(sentence1)
doc = nlp1(r)
#print doc
for token in doc:
t.append(token.vector)
row['text']=sentence;
row['feature']=t;
row['lable']=1;
Xtrain.append(t)
labels.append(1)
train_data.append(row)
except:
print "skip data row",sentence
f = "data/train-neg.txt"
for sentence in open(f).xreadlines():
row={}
t=[]
try:
sentence1=''.join([i if ord(i) < 128 else ' ' for i in sentence])
r=unicode(sentence1)
doc = nlp1(r)
#print doc
for token in doc:
t.append(token.vector)
row['text']=sentence;
row['feature']=t;
row['lable']=1;
Xtrain.append(t)
labels.append(0)
train_data.append(row)
except:
print "skip data row",sentence
print("completed processing the data")
max_words=50
num_classes = np.max(labels) + 1
X_train = pad_vec_sequences(Xtrain, maxlen=max_words, padding="post", truncating="post")
y_train = keras.utils.to_categorical(labels, num_classes)
print("Training data: ")
print(X_train.shape),(y_train.shape)
X_train=np.reshape(X_train,(X_train.shape[0],max_words));
print("input tensor")
print(X_train.shape),y_train.shape
However directly using the word vector embeddings does not often lead to good results, since the embeddings are trained over the common crawl corpus but it may no be representative of the current database.
Another approach is to use pretrained word vector embeddings as initializations for Embedding Layer Thus uses the glove2vec word vector embedding as initialization for the embedding layer of the LSTM neural network.
You can download the glove2vec word vector embedding data from GloVe: Global Vectors for Word Representation . We will download the glove2vec trained over wikipedia 2014 database using the top 400K words. http://nlp.stanford.edu/data/glove.6B.zip
We will use the 100 dimensional feature represenation in this example.
First we will read all the words from the glove2vec database and create a index of words and corresponding word vector embedding
MAX_SEQUENCE_LENGTH = 50
EMBEDDING_DIM = 100
embeddings_index = {}
f = open(os.path.join('data/', 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
For the LSTM network the traning data will consists of sequence of word vector indices representing the movie review from the IMDB dataset and the output will be sentiment. We will use the same database as used in the article Sequence classification with LSTM.
f="data/tpos.txt"
train_data=[]
Xtrain=[]
labels=[]
print "reading postive samples"
text=[]
for sentence in open(f).xreadlines():
#print "AAAAAAAAAA",sentence
row={}
tt = [];
t=[]
try:
sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])
str1 = " ".join(str(x) for x in sentence1);
labels.append(1)
text.append(str1)
except:
print "pskip data row",sentence
continue;
f="data/tneg.txt"
print "reading negative samples"
for sentence in open(f).xreadlines():
row={}
tt = [];
t=[]
try:
sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])
str1 = " ".join(str(x) for x in sentence1);
#Xtrain.append(t)
labels.append(0)
text.append(str1)
except:
print "nskip data row",sentence
continue;
each element of text
contains an sequence of words. Now we use a Tokenizer to represent words by their word indexes.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
MAX_NB_WORDS = 100000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
#create the word indices
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index
#convert the training data to sequence of word indexes
sequences = tokenizer.texts_to_sequences(text)
This tells the tokenizer to consider only the most frequently occuring 100K words in the training dataset
we pad the sequences to create a sequence of same length to be passed to the LSTM network
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
First we create the embedding layer of LSTM network and initialize the weights using the embedding matrix created in the earlier step. We set trainable to true so that embeddings are updated based on the current training data as the embedding matrix passed is used as initialization . If we set to trainable to false then the embeddings will not be updated and network will use the embeddings as per the embdding matrix itself.
embedding_layer = Embedding(input_dim=len(word_index) + 1,
output_dim=EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
Now we will create the network and train the same
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dense, Embedding
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(128, return_sequences=False
, input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(data,y_train,epochs=20, batch_size=100, verbose=1)
save_model(model,"/tmp/model4")
model.evaluate(X_test,y_test)
The approach of using exising word vector embeddings as initialization in embedding layer provides the best results .
This approach can be also considered as approach for transfer learning where the embeedings learnt over a large corups are being use to learn embeddings over a much smaller corpus ,how the initializations boosts learning and generalization ability of the classifier at hand. If we dont have enough training data to learn good embeddings over current data,it may limit the generalization ability of the network.however by using this approach we are able to use existing existing embedding matrix and tune it specifically for the data at hand.
The complete code for the same can be found at code1