Using Pre Trained Word Vector Embeddings for Sequence Classification using LSTM

Follow Jan 30, 2018 · 6 mins read

In this article we will look at using pre trained word vector embedding for sequence classification using LSTM

Using Pre-Trained Word Vector Embeddings

In the article NLP spaCy Word and document vectors we saw how to get the word vector representation trained on common crawl corpus provided by spacy toolkit. We will use this pretrained word vector representation rather than training our own Embedding Layer

We cannot use the dataset provided by keras directly so we download the raw database

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

We will use the following script to extract the data

#this function will convert text to lowercase and will disconnect punctuation and special symbols from words
function normalize_text {
  awk '{print tolower($0);}' < $1 | sed -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/"/ " /g' \
  -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' -e 's/\?/ \? /g' \
  -e 's/\;/ \; /g' -e 's/\:/ \: /g' > $1-norm
}

cd ..
mkdir nbsvm_run; cd nbsvm_run

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xvf aclImdb_v1.tar.gz
rm aclImdb_v1.tar.gz

for j in train/pos train/neg test/pos test/neg; do
  for i in `ls aclImdb/$j`; do cat aclImdb/$j/$i >> temp; awk 'BEGIN{print;}' >> temp; done
  normalize_text temp
  mv temp-norm aclImdb/$j/norm.txt
  rm temp
done

mkdir data
mv aclImdb/train/pos/norm.txt data/train-pos.txt
mv aclImdb/train/neg/norm.txt data/train-neg.txt
mv aclImdb/test/pos/norm.txt data/test-pos.txt
mv aclImdb/test/neg/norm.txt data/test-neg.txt

The above script store postive examples in file data/train-pos.txt and negative examples in data/train-neg.txt.

we will use spacy to get word vector embeddings


import imdb
import os
import spacy


def pad_vec_sequences(sequences, maxlen=40):
        new_sequences = []
        for sequence in sequences:

            orig_len, vec_len = np.shape(sequence)

            if orig_len < maxlen:
                new = np.zeros((maxlen, vec_len))
                for k in range(maxlen-orig_len,maxlen):
                    new[k:, :] = sequence[k-maxlen+orig_len]

            else:
                new = np.zeros((maxlen, vec_len))
                for k in range(0,maxlen):
                    new[k:,:] = sequence[k]

            new_sequences.append(new)

        return np.array(new_sequences)

def test():        
    nlp1 = spacy.load('en')

    f="data/train-pos.txt"
    train_data=[]
    Xtrain=[]
    labels=[]
    for sentence in open(f).xreadlines():
        
        row={}

        t=[]
        try:
            sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])
            r=unicode(sentence1)
            doc = nlp1(r)
            #print doc
            for token in doc:
                t.append(token.vector)

            row['text']=sentence;
            row['feature']=t;
            row['lable']=1;
            Xtrain.append(t)
            labels.append(1)
            train_data.append(row)
        except:
            print "skip data row",sentence


    f = "data/train-neg.txt"
    for sentence in open(f).xreadlines():

        row={}

        t=[]
        try:
            sentence1=''.join([i if ord(i) < 128 else ' ' for i in sentence])
            r=unicode(sentence1)
            doc = nlp1(r)
            #print doc
            for token in doc:
                t.append(token.vector)

            row['text']=sentence;
            row['feature']=t;
            row['lable']=1;
            Xtrain.append(t)
            labels.append(0)
            train_data.append(row)
        except:
            print "skip data row",sentence


    print("completed processing the data")
    max_words=50

    num_classes = np.max(labels) + 1

    X_train = pad_vec_sequences(Xtrain, maxlen=max_words, padding="post", truncating="post")
    y_train = keras.utils.to_categorical(labels, num_classes)
    print("Training data: ")
    print(X_train.shape),(y_train.shape)

    X_train=np.reshape(X_train,(X_train.shape[0],max_words));

    print("input tensor")
    print(X_train.shape),y_train.shape

However directly using the word vector embeddings does not often lead to good results, since the embeddings are trained over the common crawl corpus but it may no be representative of the current database.

Another approach is to use pretrained word vector embeddings as initializations for Embedding Layer Thus uses the glove2vec word vector embedding as initialization for the embedding layer of the LSTM neural network.

You can download the glove2vec word vector embedding data from GloVe: Global Vectors for Word Representation . We will download the glove2vec trained over wikipedia 2014 database using the top 400K words. http://nlp.stanford.edu/data/glove.6B.zip

We will use the 100 dimensional feature represenation in this example.

First we will read all the words from the glove2vec database and create a index of words and corresponding word vector embedding

    MAX_SEQUENCE_LENGTH = 50
    EMBEDDING_DIM = 100 
    embeddings_index = {}
    f = open(os.path.join('data/', 'glove.6B.100d.txt'))
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

For the LSTM network the traning data will consists of sequence of word vector indices representing the movie review from the IMDB dataset and the output will be sentiment. We will use the same database as used in the article Sequence classification with LSTM.

    f="data/tpos.txt"
    train_data=[]
    Xtrain=[]
    labels=[]
    print "reading postive samples"
    text=[]
    for sentence in open(f).xreadlines():
        #print "AAAAAAAAAA",sentence
        row={}
        tt = [];
        t=[]
        try:
            sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])

            str1 = " ".join(str(x) for x in sentence1);

            labels.append(1)

            text.append(str1)
        except:
            print "pskip data row",sentence
            continue;

    f="data/tneg.txt"
    print "reading negative samples"
    for sentence in open(f).xreadlines():
      
        row={}
        tt = [];
        t=[]
        try:
            sentence1 = ''.join([i if ord(i) < 128 else ' ' for i in sentence])

            str1 = " ".join(str(x) for x in sentence1);
            #Xtrain.append(t)
            labels.append(0)

            text.append(str1)
        except:
            print "nskip data row",sentence
            continue;

each element of text contains an sequence of words. Now we use a Tokenizer to represent words by their word indexes.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

MAX_NB_WORDS = 100000

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

#create the word indices
tokenizer.fit_on_texts(text)

word_index = tokenizer.word_index

#convert the training data to sequence of word indexes
sequences = tokenizer.texts_to_sequences(text)

This tells the tokenizer to consider only the most frequently occuring 100K words in the training dataset

we pad the sequences to create a sequence of same length to be passed to the LSTM network

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

First we create the embedding layer of LSTM network and initialize the weights using the embedding matrix created in the earlier step. We set trainable to true so that embeddings are updated based on the current training data as the embedding matrix passed is used as initialization . If we set to trainable to false then the embeddings will not be updated and network will use the embeddings as per the embdding matrix itself.

    embedding_layer = Embedding(input_dim=len(word_index) + 1,
                                output_dim=EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=True)

Now we will create the network and train the same


    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    from keras.layers import Dense, Embedding

    model = Sequential()
    model.add(embedding_layer)


    model.add(LSTM(128, return_sequences=False
               , input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)))


    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(data,y_train,epochs=20, batch_size=100, verbose=1)
    save_model(model,"/tmp/model4")

    model.evaluate(X_test,y_test)

The approach of using exising word vector embeddings as initialization in embedding layer provides the best results .

This approach can be also considered as approach for transfer learning where the embeedings learnt over a large corups are being use to learn embeddings over a much smaller corpus ,how the initializations boosts learning and generalization ability of the classifier at hand. If we dont have enough training data to learn good embeddings over current data,it may limit the generalization ability of the network.however by using this approach we are able to use existing existing embedding matrix and tune it specifically for the data at hand.

The complete code for the same can be found at code1

Written by