Data Science
Skills for Data Scientist:
Data scientist way of thinking:
In early stages, when we have a problem, physicist or chemist understand the problem and come up with theory. And, engineers took these equation and change few parameters to come up with thumb rules or guidelines. This is termed as deductive learning, which emphasis that few hypothesis cannot be proved. Then from every new development in science is proved either with mathematics or experiments or both.
Inductive learning: Experiment is the way of learning, make some observations and systematically generalize the observations. This is very different from what we have done in deductive learning. We need enormous amount of intuition to generalize from an observation. But, with the help of computers we can do it much better way.
How data scientist is different from Statisticians or traditional Science learning? Mathematics & Physics are believes that they make hypothesis and then use models or experiments to prove it, whereas data scientist starts with experimentation and then came up with observations. This may result same.
But we (data scientists) don’t know the reason. We know correlation, but may or may not know causation. (Super market example)
Data scientist job is to extract insights and not responsible for what to do with data. This is domain expert’s job.
Where data science not works?
Need a lot more data to come to the correct conclusions. (In Inductive learning, data is needed to build the theory. In Deductive learning, data is needed to validate the theory.)
When randomness prevails, much more than expected (Dilshukanagar incident, Stock Markets).
Anything that doesn’t happen because of systematic pattern is likely to fail or not predictable. So, data science won’t work when there is no systematic pattern, lot of data and particular about causation.
Where data science works?
When you are dealing with absolutely non-linear behaviors, still with the trend and pattern, ex: people change their stand based on situations. This can’t be achieved with deductive models.
In network (things are connected like changes in govt. impacts stock markets). Collecting data was a problem, so deductive models were favored earlier. But, now with the technology and storage, we are able to collect data and inductive learning become essential.
What are skills data scientist required? Data Scientist should listen to data, systematically generalize and extract patterns using some mathematics and tell some story. Data munging skills is required for Data Scientist.
Is Inductive learning better than Intuition? While human are very good at picking up what are important, but not very good at synthesize. Simple models do better than experts.
Problems that can be solved in data science:
Unsupervised Learning – One of the very famous applications are clustering
Supervised Learning – Classification In Classification, if we have given data points there are few points which are easy to measure (Age, sex, Salary etc.) and points which are difficult to measure like CLV. The entire purpose of classification is I will do some experiment and some get data point, I will measure all easy to measure and I will measure all difficult to measure. Regression and Classification are both supervised learning, where I have Y and bunch of X’es and trail experimentation is done where I know x and y and goal is to find function f. In regression y is taken any value between –infinity to +infinity, so the line is drawn which passes closely to most of the points. In classification, y can be only few values, where we can split the space into categories
Optimization: The mathematical/graphical intuition will help us when we think about o Classification, is splitting the space into various buckets and putting the new value in the appropriate bucket. o Regression, drawing a line closest to all the points, so that for any new value in x1, x2, x3 etc., I can predict y o Optimization, there is a very complex curve and I am searching for a high or a low
Simulation – Individuals are easy but groups and interactions are complex. Simulations are help us to solve these kind of problems. Whenever we face challenge to prove something mathematically, simulation will work better. Example: Monte Carlo, discrete events. Invents: To predict the kid will have Asthma – if father’s blood group is b +ve, mother smoke during pregnancy and kid eat lot of carbohydrates Difference between BI and Predictive analytics? BI is like a rear view, looking at the past and generate the pattern intuitively. In predictive analytics, the patterns are generated systematically.
Which method to use solve the problem?
K-Nearest Neighbor – This is building a non-linear model (Which is hard to build) by taking a nearest neighbor without realizing.
Logistic Regression – Draw a line to separate the cluster
Decision Trees – This is non-linear, unlike logistic regression which is linear
Discriminative models (above models) & Generative models – need more information to teach the system (Naïve Bayes)
Lazy versus eager learning – Building model while seeing the data is eager learning and give me all data, then I will build model. In real times, we mostly go for eager learning For Class imbalance problems, KNN is the best No free lunch theorem – It says that there is no single theorem that works well all the time. All above said are established models from past 2 decades.
What happens when we go to higher dimensions? Curse of dimensionality – At higher dimensions, most of the area in border and the core will come down. For example, area of steel rod will reduce in higher dimensionalities. Spectral methods, Ensembles are do well in deal with curse of dimensionality
How to deal with Non-Linear data? Neural Net is a non-linear logistic regression. Kernel trick & Support Vector Methods SVM will work better than Logistic Regression in higher dimensions with better accuracy.
What will be the output of any model? Rules or equation, depends on business
What do we do when we get a problem (Solution Architecture)?
ROI Understanding a. Business problem b. Current approach
Feature engineering a. Can I add, transform or existing attributes to generate new attributes
Get the data into structured form – Sharpen the data
Explore & visualize the data using charts
Build the model
Story telling
Python or R, which one is best? Consultant – R will be the best tool, as this role requires more exploration Data Analyst – for product development, python will help more for building products quickly