Data Mining – Extraction, Collection and Preparation of raw data
Data is available everywhere – in websites, in documents, in social media, in news, in sensors, in machines, in robots and in this blog post as well. With this huge amount of data in place, number of data science applications exploiting this data and producing useful information is also very high. Analytics scenarios such as exploring the author’s personality by analyzing his vocabulary in a blog post, or building the social networks of twitter and facebook friends, or digitization of old documents are the classic examples where data is exploited to find insights.
In each of these scenarios, there always exists an important portion to start with – data mining. Data Mining is the technique of creating a raw data set by capturing data from a data source. The term data mining though has a broader meaning when talked about analytics, but in this blog we will discuss about data mining as the first and initial step of any data science application which deals primarily with data collection and data extraction. And We will also look at the some of the techniques of data mining.
Data Mining
Data Mining is a process of collecting data, extraction of data and preparation of raw data set. It results in formation of a datasets which are in the ready to analyse formats. Computational process of discovering patterns in large data sets or big data and methods involving machine learning, statistics and natural language processing are used in Data Mining. The overall aim is to fetch information from the data sources and transform them into defined structures for further use. Following are some of the important areas of data mining in action.
Web Data Mining
Data mining from web is sometimes also termed as data crawling or web scraping. In this process, data which is available on the web, embedded in html tags, tables and graphs is extracted. For large scale web data mining in large scale, a seeding, crawling and parsing data architecture is a good practice to adopt. In seeding phase all the links and urls (seeds) are collected, in crawling phase scrapers and crawlers are developed to pull out data corresponding to each seed, this data is generally html, xml or json formatted, and in parsing phase, the useful data points are extracted from the crawled data.
Some of the useful libraries for the purpose of web data mining are beautifulsoup (python), requests (python), mechanize (python), selenium (java) and phantomjs (automation libraries, but can also be used for extraction). Following is the python code to extract data from www.google.com using python
from BeautifulSoup import BeautifulSoup import urllib url = "http://www.google.com" htmlfile = urllib.urlopen(url) htmltext = htmlfile.read() soup = BeautifulSoup(htmltext) soup.find('div',attrs={"class":"classname"})
Social Media Data Mining
There is enormous amount of data available on social media sources such as twitter, facebook and linkedin. These platforms act as big data houses with open gates to push and pull interaction. The relevant data extraction APIs are these open gates for pulling the social data from these sources. Twitter provides Streaming and Rest APIs for data extraction purposes while Facebook provides Graph API for the same. Packages such as twython, tweepy, twitter-api (python) are used for this purpose
from twitter import Twitter twitter = Twitter(access_key , access_secret ,consumer_key, consumer_secret) query = twitter.search.tweets(q = "data science") for result in query["statuses"]: print result[‘text’]
Data Mining from Documents
Lots of text data is available in the pdf files, word documents, text files or other formats such as xml or json. Sometimes data is also present in the scanned images. Though with comparatively less accuracy, this data can also be mined and used for data driven applications and tools . The techniques such as pdf parsing and optical character recognition are useful to extract this type of data. For example, most of the old data of citizens and policies provided by government sector is stored in the form of pdf reports. This data can be digitized to machine readable text and analyzable format to make important decisions such as providing beneficial services to the poors who are deprived of major facilities.
import pdfminer libraries def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = file(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages, password, caching): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text
Mobile Data Collection
Recently mobiles have come out as an important source of data. Following tools can be used to get mobile data:
- Open Data Kit (ODK) is a free and open-source set of tools which help in management of user metadata and mobile data collection solutions.
- Commcare is an open-source mobile platform designed for data collection, client management, decision support, and behavior change communication.
- GeoODK provides a way to capture and store geo-referenced information, along with a suite of tools to visualize, analyze and manipulate ground data for specific needs.
The main role of data mining is to convert data that is either semi-structured or completely unstructured into structured data and make it easily use-able for further processing. There may be number of sources of data and different techniques to get this data. In next blogs we will discuss about further processes of data science and how to use this mined data.