In this example we will look at the problem of natural language understanding.The aim is simply to understand the meaning of sentense. This is a unsolved problem in the context of open domain sentense understaing ie there exists no system that can understand the meaning of any open domain sentense.However many of the techniques that work well do so within a specific domain.
An intent is a group of utterances with similar meaning
Airline Travel Information System (ATIS) dataset is a standard data set used to demonstrate and benchmark various NLU algorithms.ATIS consists of spoken queries on flight related information
we will use modified version of the ATIS dataset used in the project https://github.com/yvchen/JointSLU . This database contains spoken queries,associated intents and entities in the statements.
The dataset consists of sentenses labelled/encoded in Inside Outside Beginning (IOB) representation.
An example utterance is
Show | flights | from | Boston | to | New | York | today | EOS |
O | O | O | B-dept | O | B-arr | I-arr | B-date | atis_flight |
The aim would be to first understand the intent of the statement,which in the above examples is to show flight information (atis_flight ) .The second operation is to understand that source city is boston,departure city is new york, and date and time is specified by today.Second operation is called as slot filling .
First task is to read the ATIS dataset
def processCSVData(filename):
output=[]
lables=[]
with open(filename, 'r') as f:
for rowx in f:
r={}
words = rowx.split()
intent_name=words[len( words)-1]
words=words[0:len( words)-1]
aa="";
cc="";
state=-1
for w in words:
if w=="EOS":
state=1;
if state==0:
aa=aa+" "+w;
if state==1:
cc=cc+" "+w
if w=="BOS":
state=0
r['text']=aa
r['tags'] =cc
r['intent']=intent_name
lables.append(intent_name)
output.append(r)
lables=np.unique(lables)
return output,lables
output,lables=processCSVData("atis.txt")
#print output[0],lables
print "number of samples",len(output)
print "number of intent",len(lables)
The ATIS official split contains 4,978/893 sentences and intent 22 classes.
Our first task would be to just perform intent detection. The input to the classifier is a sequence of words and output is the intent associated with the statement.
we will use LSTM for intent classification
As discussed in the article we will be using approach of sequence classification in LSTM using pre trained word vector embeddings.
Below is the code that reads the glove2vec vector embeddings and creates a neural network embedding layer which uses the pre trained embedding vectors as initializations.The output of the function is a tokenizer that converts words to integers representing dictionary entires as well as embedding layer that can be used as first layer of a sequential model in NLP pipeline
def getEmbeddingLayer(EMBEDDING_DIM):
embeddings_index = {}
f = open(os.path.join('data/', 'glove.6B.100d.txt'))
count=0
words=[]
for line in f:
values = line.split()
word = values[0]
words.append(word)
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
count=count+1;
f.close()
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index
print "total words embeddings is ",count,len(word_index)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(input_dim=len(word_index) + 1,
output_dim=EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
return tokenizer,embedding_layer
below is the code to create embedding layer
def create_embedded_model(embedding_layer,num_classes,MAX_SEQUENCE_LENGTH,EMBEDDING_DIM):
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(128, return_sequences=False
, input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
return model;
Training the same over the ATIS dataset with 20 epochs and batch size of 100 gives an accuracy of 99% . This data set is too small to attain ant kind of benchmark and it would clearly be overfitting on the data since dataset is very small for DNN applications .
This article mearly demonstrates how LSTM networks can be used for intent detection task.
The complete code can be found at code
References
- Keras Tutorial - Spoken Language Understanding
- http://mrbot.ai/blog/natural-language-processing/understanding-intent-classification/
- https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
- https://docs.google.com/viewer?url=http://ccc.inaoep.mx/~villasen/bib/dmello-roman07.pdf