“Data is the new oil” — anonymous
Data is everywhere around us – More than 200 million users on twitter, share content, videos and photos every day (Social Media Data). 500 Million smartphones are used for making calls, texts messaging every second (Smartphone User Data, Calls Data and Message Text Data). 10 thousand airplanes are always off the ground in the sky (Aviation Data). More than 10 Million bank transactions are carried out daily (Bank Data).
According to a recent research, IBM suggested that 90 percent of the data in the world has been generated in past three years alone. This data originates from everywhere – social media, website visits, bank transactions, documents and weather sensors to name a few out of a huge list. This data is ubiquitous and cheapest resource available on the earth.
The volume of this increasing data can not be measured and it’s growing every second. This immense amount of data is neither a garbage nor a waste, but it’s a diamond mine for various organizations, industries and communities. This data acts as the fuel for driving the boat of effective decision making !
But “how to use this data?” herein lies the pivotal question. How is this data exploited using technology and algorithms to make sense out of a massive data pile ? There is a single answer to these questions – Data Science.
Data Science is a practice of connecting the dots between world of business and world of data. It is an unconventional art of converting the unstructured data to linked, cleaned and structured sets of productive information and fruitful knowledge. This information in-turn enables effective decision making, smart planning and intelligent applications. The process of data science includes finding the patterns, identifying the insights and prototyping them. The major components which makes it a complete end to end process are:
- Getting the data – Data Mining: The first step for any data science application is to obtain right set of data. Data mining include practices for obtaining the data from particular sources and creating the required datasets. Sources of data may vary according to the business to business. Most common areas of data mining are data extraction from open web, data from parsing the documents and files, survey based data collection and data from freely generating data streams.
- Fixing the data – Data Cleaning: There is huge amount of data available for use but not every chunk of it is always useful. This step of data standardization includes cleaning of data, handling the noise and converting it to analysis ready dataset.
- Analyzing the data – Data Analysis: The most important step of data science is to analyse the data and generate some some outputs. The popular approaches of statistics, machine learning, text mining are performed in this step. This includes:
Statistical Measures like Fitting a model, number crunching and regression.
Machine Learning techniques like linear classifiers, logistic functions or neural networks.
Natural Language Processing practices like text mining, sentiment analysis or entity recognition. - Actionable Insights – Data Prototyping: Data Analysis results in production of actionable insights. These insights are used to make better decisions, development of data applications and data driven. Dashboards, visualizations, reports, sheets and applications are used to get a descriptive view of these insights.
The following infographics shows some examples of the real life business problems involving data science and its impact.
In this article, I discussed about the concept definition of data science, major components of data science and usefulness of data science. Obviously, a lot more always exists in real world scenarios. Feel free to share your views in the comment section.