In Natural language processing one of the most common questions is how to convert a sentense to some kind of numeric representation for machine learning algorithms.
One of the elemenatry ways of doing this to represent a sentense by its mathematical represetation is by measuring the relative frequency count of words occuring in sentense
Given a dictionary we can associate an index with every word occuring in the text document.
A dictionary can be constructued from the training corpus or we can use a pre defined dictionary containing a word list
We will be using Gensim NLP software for Topic modelling in this article.
we will be using the websters dictionary for building the dictionary for NLP purposes Download the dictionary.txt file from https://github.com/pi19404/dictionary github repository
The code to create the dictionary file is as follows
def saveDictionary(self,source,dest):
# Set up input and output files
dict_file = self.cwd + '/'+source;
dest_file=self.cwd+"/" + dest;
#read the input text file
f = open(dict_file, 'r')
lines = f.readlines()
f.close()
#tokenize the data
tokenize_data = [[word for word in line.lower().split()] for line in lines]
#create and save the dictionary file
dictionary = corpora.Dictionary(tokenize_data)
dictionary.save(dest_file)
def loadDictionary(self,filename):
dict_file = self.cwd + '/'+filename;
dictionary = corpora.Dictionary.load(dict_file)
return dictionary;
The function saveDictionary
accepts as a input a text file source
and creates a dictionary file as outputdest
.
The loadDictionary
can load a dictionary from the file saved by saveDictionary
function
Next to represent a sentense we use a bag of vector model of representation.Where every word in a sentense is associated with a dictionary index and frequency count of occurence of the word in sentense.
def bowfeature(self,sentense,dictionary):
corpus = dictionary.doc2bow(sentense)
return corpus
tokens="do you know to go to market"
tokenize_data = [word for word in tokens.lower().split()]
feature1=bowfeature(tokenize_data,dict)
print feature1
The above function accepts as input a sentense and dictionary object and returns the bag of words representation of a sentense
For example The text do you know to go to market
has the bow feature representation as [(29, 1), (116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)]
do -> [(1688, 1)]
you -> [(29, 1)]
know -> [(3685, 1)]
to -> [(116, 1)]
go->[(928, 1)]
to->[(116, 1)]
market->[(23187, 1)]
Each sentense may contain variable word lengths and its mathetical representation consists of sequence of words and its associated relative frequency of words occuring in the sentense wrt to entire courpus being analyze.
Tf-idf
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears
The TF term computes the relative frequency count of words occuring the document/sentence being analyzed and computes how important is the word in a document.The IDF term computes the imporance of word wrt to the entire corpus.The words which occur very frequently in the corpus will be assigned lower values than word which occur rarely.
function computetfidf(self,features)
tfidf = models.TfidfModel(features)
return tfidf
Let us consider the following two statemets as toy example
([u'do', u'you', u'know', u'to', u'cook'], [(29, 1), (116, 1), (1688, 1), (3685, 1), (83291, 1)])
([u'do', u'know', u'to', u'go', u'to', u'market'], [(116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)])
[(116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)] [(928, 0.7071067811865475), (23187, 0.7071067811865475)]
[(29, 1), (116, 1), (1688, 1), (3685, 1), (83291, 1)] [(29, 0.7071067811865475), (83291, 0.7071067811865475)]
The TF term for sentences are
[(116, 2/6), (928, 1/6), (1688, 1/6), (3685, 1/6), (23187, 1/6)]
[(29, 1/5), (116, 1/5), (1688, 1/5), (3685, 1/5), (83291, 1/5)]
The IDF term is
[(116, 0), (928, 1), (1688, 0), (3685, 0), (23187, 1)]
[(29, 1), (116, 0), (1688, 0), (3685, 0), (832911)]
Now multiplying above two we get
[(116, 0), (928, 1), (1688, 0), (3685, 0), (23187,1)]
[(29,1), (116, 0), (1688, 0), (3685, 0), (83291,1)]
We can normalize the tfidf values for each document
[(116, 0), (928, 1/sqrt(2)), (1688, 0), (3685, 0), (23187,1/sqrt(2))]
[(29,1/sqrt(2)), (116, 0), (1688, 0), (3685, 0), (83291,1/sqrt(2))]
Thus we can see that TFIDF has zero values for terms which are repeating the corpus while has non zero values for terms that are less frequenct in the corpus.
Assigning low values to stop words is automatically done by the TFIDF process therby eliminating the need to do stop word removal as pre processing stage in the NLP pipeline.
Thus given a set of sentenses we have obtained a mathematical representation of sentenses.
References
- https://radimrehurek.com/gensim/index.html