Introduction to Big Data
Big data is just a name used by people to refer data which has characteristics of all/any of the 3V’s. Though data collection itself isn’t new, the recent technological advances in chip and sensor technology, the Internet, cloud computing, and our ability to store and analyze data that have changed the quantity of data we can collect. Things that have been a part of everyday life for decades — shopping, listening to music, taking pictures, talking on the phone — now happen more and more wholly or in part in the digital realm, and therefore leave a trail of data.
The other big change is in the kind of data we can analyze. It used to be that data fit neatly into tables and spreadsheets, things like sales figures and wholesale prices and the number of customers that came through the door. Now data analysts can also look at “unstructured” data like photos, tweets, emails, voice recordings and sensor data to find patterns.
Know thy customer: Yahoo big data
Yahoo analyzed the data of the clicks of their visitors and found which customers are visiting which pages on their sites. Based on this, they customized the links they showed on the home page. By the way, this is a non-trivial task given that Yahoo gets 35 million clicks a day and the system has to generate The system generates 45,000 totally unique versions of the personalized items every five minutes. The results as given above are fairly impressive.
WHAT are you getting into and WHY?
A breakthrough in machine learningwould be worth 10 Microsofts – William H Gates
I am sure you have already heard that we are generating an awful lot of data off late. Following are my personal favorites. An Apple iPAD has roughly 64GB space. If we were to use iPADs to store the data we are creating, we will create a mountain approximately 20 times taller than mount Everest while storing the data created in 2012!
Here is another real world comparison
Zettabyte is about the amount of data we are creating every year. So, the point is that data is exploding and so is the demand for data scientists. While we are at it, Data science is my favorite term but there are a lot of synonyms. The product companies like Google, Facebook and Amazon are referring to this as Big Data. IBM referred to this as “Business Analytics and Optimization”. Data Analytics is also highly used. Universities call the field “Data Mining” or “Machine Learning” or sometimes use more esoteric terms like “Pattern recognition” and “Statistical learning”. While one can spend a lot of time analyzing the nuances and differences between these terms, for our purposes, they all roughly mean the same. However, data science is different from Business Intelligence. Taking a crude example of a car, BI is like rearview mirror. It shows what happened in the past. Data Science is like head lights. It gives an indication of what lies ahead. This ability to predict, according to many, is capable of unprecedented ROIs (returns on investments).
What do they do?
“The sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?” – Google’s Hal Varian
Forecasting/Prediction
Classification
Simulations
Enterprises become increasingly dependent on the amount of data they are able to collect and analyze; they are seeking better ways to process information on a large scale. Analyzing massive data sets quickly helps businesses gain better insights and build a single view of their place within an industry.
Most people have some idea that companies are using big data to better understand and target customers. Using big data, retailers can predict what products will sell, telecom companies can predict if and when a customer might switch carriers, and car insurance companies understand how well their customers actually drive. It’s also used to optimize business processes. Retailers are able to optimize their stock levels based on what’s trending on social media, what people are searching for on the web, or even weather forecasts. Supply chains can be optimized so that delivery drivers use less gas and reach customers faster.
But big data goes way beyond shopping and consumerism. Big data analytics enable us to find new cures and better understand and predict the spread of diseases. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions. A number of cities are even using big data analytics with the aim of turning themselves into Smart Cities, where a bus would know to wait for a delayed train and where traffic signals predict traffic volumes and operate to minimize jams.
Adaptive Industries
Healthcare
Far more medical information can now be collected and analyzed in near real time, which helps doctors improve patient care. Coordinating data from medical records and comparing the results of case studies is essential for hospitals, doctors and health laboratories to gain insights. With so many healthcare devices now connected to the Internet and to each other, researchers are making connections between previously disparate sets of information, leading to breakthroughs in treatment. Big data is playing such a pivotal role in the healthcare industry that a recent study by EMC found that 59 percent of providers said that fulfilling their mission objectives objectives over the next five years will depend on the successful use of big data.
Telecommunications
Big data can power a single, unified view of telco customers, which helps providers deliver personalized, satisfying customer experiences. This single view helps you acquire new subscribers, grow existing relationships and retain valuable customers. According to research conducted by IBM, telecoms that have employed big data projects experienced a 92 percent decrease in processing time when analyzing network and call data. With better analytical efficiency comes improved service, customer satisfaction and loyalty.
Utilities Utilities provide energy, measure its use and collect payment—but the system they manage is far from simple. When data is scarce, a utility’s ability to calculate outages, forecast service needs or coordinate maintenance is reduced to educated guesswork. With big data analysis of smart meter streams combined with other data, utility companies optimize the use of their generation and maintenance resources.
Banking Financial institutions manage our money, and we expect them to protect our financial information. Big data improves the IT security posture of banks, analyzing network behavior to find suspicious or anomalous behavior. Apart from helping to improve cyber security, big data analysis can also improve a financial institution’s ability to calculate credit scores, set competitive interest rates and predict which customers are at risk of default on their loans.
Insurance Insurance companies have to calculate hundreds of interrelated variables in order to coordinate policies for their clients. They need to accurately assess risk to properly price their products, and big data analysis makes those calculations much easier and faster to complete. Industry experts also suggest that big data can lead insurance companies to discover links between variables they never had the ability to consider before.
Travel Travel companies decrease cost and improve traveler satisfaction by enriching existing data on travel routes and weather patterns with more detailed big data points on fuel costs, ticket prices and space availability. This detail allows them to improve logistics, safety and satisfaction.
Retail The business problem of supply chain management is an ideal candidate for data-driven improvement. Either products are on the shelf when the customer wants to buy them, or they are not, and delivering supplies on time requires constant vigilance of thousands of ever-changing variables. Organizations with the most efficient and effective supply chains beat the competition, and big data analysis can help maintain ideal stock levels.
Big Data Engineering
Let us take a look at two important evolving ecosystems; Big data engineering for storing and analyzing non-relational databases and data analysis for extracting insights from the data.
Big Data Engineering: For storing the large amounts of data in the first place we do not need large relational databases. They are expensive. A new paradigm called NoSQL is being developed for unrelated databases of excessive size. The goal is not consistency but fails over and speeds. The current options are Cassandra, MongoDB, CouchDB, Redis, Riak, HBase, Couchbase, Neo4j, Hypertable, ElasticSearch, Accumulo, VoltDB, Scalaris.
HBase& Cassandra are columnar/table oriented databases. As HBASE is an Apache product, maybe it is best integrated with Hadoop. Cassandra is more used for a highly distributed data store Cassandra is a good choice (e.g.
Netflix). Shopping cart and other table oriented applications are more suited for columnar dbs. Mongo provides document oriented storage. A hierarchical, text-oriented use case (blogs, comments, even product catalogue) is better suited for document oriented dbs. Mongo enables complex queries. Couchbase combines both columnar and document oriented architectures.
For analytics on the data, we need the Hadoop framework comprising of HDFS and Map Reduce engines. PIG has been developed to create a scripting like environment for Hadoop applications. It makes programming Hadoop very easy. HIVE is an SQL like query engine that again makes querying in Hadoop lot easier.