Let’s take the simple case study of book reviews. We all read books. We google for various options available. As an end user, I am interested in few particular genres. In order to search books belonging to these genres, I would be searching for few key phrases or keywords. The intention here is to find out the best books to read. But this approach doesn’t help me to find information that I am unaware of. Regular search only throws up some results containing the keywords that I myself mentioned. Search results often lack relevance and this approach finds documents, and not the knowledge. Hence the need for text mining, as 90% of world’s data is unstructured.
We should also be aware of the differences between the search approach and the discover approach. Goal oriented search on structured data is termed as “Data Retrieval” and on unstructured data is termed as “Information Retrieval”. Similarly, opportunity based discovery on structured data is termed as “Data Mining” and on unstructured text is termed as “Text Mining”.
In the text mining process, the unstructured data is converted into numerical data with meaningful indices and then a predictive model is built on these numerical indices.
But there are various challenges to this approach:
- Very high number of possible dimensions: All possible words and phrase types in the language are impossible to analyse.
- Unlike data mining, the records here are not structurally identical and not statistically independent.
- There is a subtle relationship between the concepts. For example, we can have two pieces of information: Company A is acquired by Company B AND Company B merges into Company A. All that we know here is something is being talked about two companies. But we need to delve deeper to find out what actually is being talked about.
- There can be ambiguity and context sensitivity. For example, the word “Apple” may refer to a fruit or it may refer to the company.
Various approaches to text mining can be broadly classified under two methodologies:
- Supervised Techniques: Training dataset which is already labeled and uses the classifier algorithm to classify new data based on the labelled data. Classification techniques may include SVM, Naive Bayes, Maximum Entropy etc.
- Unsupervised Techniques: This is undirected data mining activity. This is the lexicon based approach, also known as dictionary approach. This doesn’t require separate training and testing dataset, but instead of that, list of words or dictionary of words will be used to classify the text data. Examples include Clustering or topic modeling.
In order to help achieve what all we discussed till now, we use TM (Text Mining) package in R. This is the basic package for preprocessing the text data. Preprocessing steps may include:
- Remove unwanted lines of data
- Convert text to a corpus: Corpus is a large and structured set of texts or documents.
- Cleansing the corpus: which includes parsing the data either in upper case or lower case, discard spaces and punctuation, Removing stopwords, Stemming (improper chopping of the words at the end) and lemmatization (going to the morphological root of the word) e.g, replacing trying, tried, try etc with single word try.
- Inspecting the copora
5 and finally, creating the TDM (Term Document Matrix), which describes the frequency of terms that occur in a collection of documents. It is a mathematical matrix, where rows are the terms and the columns are the documents. We can do following analysis on the TDM: Finding frequent terms, Finding associations (terms that are co-related or are similar to each other), Clustering (separate records into groups that are similar wrt to terms contained in each record) and finally sentiment analysis (process of computationally identifying and categorizing opinions expressed in a piece of text)
Implementation in R:
Imparting some structure to the free flowing text using “Corpus” function of tm package. For example:
tweets.corpus<-Corpus(VectorSource(tweets$Tweet_Text))
In this code line, I downloaded tweets and collected them in Tweet_Text. “tweets” is the column which contain all the textual tweets. The function “Corpus” can take variety of inputs:
- DataFrame Source
- URI Source
- DirSource
- VectorSource
- XMLSource etc
Once we create the corpus, the next step is to cleanse it, where multiple functions can be applied to tm_map:
tweets.corpus <- tm_map(tweets.corpus, tolower) # conveting to lower case
tweets.corpus <- tm_map(tweets.corpus, stripwhitespace) #removing extra white spaces
tweets.corpus <- tm_map(tweets.corpus, removePunctutation) # removing punctuations
tweets.corpus <- tm_map(tweets.corpus, removeNumbers) # removing numbers
Finally, we can create the TDM, and remove sparse terms, using a simple code line:
tweets.tdm<-TermDocumentMatrix(tweets.corpus)
tweets.imp<-removeSparseTerms(tweets.tdm, 0.97)
Once TDM is created, we can realize our goal of finding associations between the terms as follows:
findFreqTerms(tweets.tdm, 10) # occurring minimum of 10 times
findFreqTerms(tweets.tdm, 30) # occurring minimum of 30 times
findFreqTerms(tweets.tdm, 50) # occurring minimum of 50 times
findFreqTerms(tweets.tdm, 70) # occurring minimum of 70 times
findAssocs(tweets.tdm, “farmers”, 0.4)
findAssocs(tweets.tdm, “budget”, 0.6)
findAssocs(tweets.tdm, “middleclass”, 0.6)
findAssocs(tweets.tdm, “income”, 0.7)
These above steps will identify all the words which are related to “farmers” with a correlation score of 0.4. This way, we can come to know the associations. We can identify (taking the example of union budget related tweets), how many times the word “farmers” was mentioned, or how many times the word “middleclass” was mentioned.
Once we are done with all this analysis, we can employ various classification methods/algorithms to achieve our end goal:
- Naive Bayes: The classification method used here is based on Bayes rule. It depends on “Bag of Words” representation of a document.
- Decision Tree: It is a Greedy, top-down, Binary recursive partitioning; divides feature space into sets of disjoint rectangular regions
- Random Forest: Ensemble Learning Technique and is a “Forest” of decision tree classifiers. Each tree influences what features are important for classification and finally aggregate the results of the trees
- SVM: Where the different categories are segregated from each other using the vector approach.
The metrics from each of these approaches can be taken, in terms of accuracy, precision and/or recall. Depending on the metric, we can choose the algorithm best suited for our case study.