top of page

Machine Learning Data Exploration & Preparation

Introduction

Hypothesis generation requires you to have structured thinking whereas data exploration requires patience to slice and dice data in multiple ways. In this article, I will focus on the steps required to clean and understand data in a comprehensive way.

To improve your structured thinking, I would suggest you to check out the flawless post written by Kunal – “Tools to Improve structure Thinking“.

7 Steps of Data Exploration and Preparation (Part 1)
Remember the quality of your inputs decide the quality of your output. So once you have your business hypothesis ready, it makes sense to spend a lot of time and effort here. By my personal estimate, data exploration, cleaning and preparation would take up to 70% of your total project time.

Below are the steps involved to understand, clean and prepare your data for building your predictive model:
1.Variable Identification
2.Univariate Analysis
3.Bi-variate Analysis
4.Missing values treatment
5.Outlier treatment
6.Variable transformation
7.Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined model.

Let’s now study each stage in detail:-

Step-1: Variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target variable, data type of variables and category of variables.

Below, the variables have been defined in different category:

Step-2: Univariate Analysis:

At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous. Let’s look at these methods and statistical measures for categorical and continuous variables individually:

Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will look at methods to handle missing and outlier values. To know more about these methods, you can refer course descriptive statistics from Udacity.

Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.

Stage-3: Bi-variate Analysis

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.

  • -1: perfect negative linear correlation

  • +1:perfect positive linear correlation and

  • 0: No correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship between two variables:

In above example, we have good positive relationship(0.65) between two variables X and Y.

Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:

  • Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories.

  • Stacked Column Chart: This method is more of a visual form of Two-way table.

  • Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chi-square test statistic for a test of independence of two categorical variables is found by:

where O represents the observed frequency. E is the expected frequency under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times the column total for Product category (2) then dividing by the sample size (81). This is procedure is conducted for each cell. Statistical Measures used to analyze the power of relationship are:

  • Cramer’s V for Nominal Categorical Variable

  • Mantel-Haenszed Chi-Square for ordinal categorical variable.

Different data science language and tools have specific methods to perform chi-square test. In SAS, we can use Chisq as an option with Proc freq to perform this test.

Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.

  • Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.

  • If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.

  • ANOVA:- It assesses whether the average of more than two groups is statistically different.

Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We need to find out whether the effect of these exercises on them is significantly different or not. This can be done by comparing the weights of the 5 groups of 4 men each.

7 Steps of Data Exploration and Preparation (Part 2)

Introduction

In Part-1 of this series, we looked at the first three steps of Data Exploration & Preparation, namely Variable identification, Univariate and Bivariate analysis. In this article, we will proceed from the fourth step.

We will look at the methods of Missing values treatment. More importantly, we will also look at why missing values occur in our data and why treating them is necessary?

Why missing values treatment is required?

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances of playing cricket by males are higher than females.

On the other hand, if you look at the second table, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for occurrence of these missing values. They may occur at two stages:

  1. Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well.

  2. Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types:

  • Missing completely at random: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value.

  • Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male.

  • Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients.

  • Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.

Methods to treat Missing Values
  1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion

  • In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.

  • In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.

  • Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.

  • Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-

  • Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it.

  • Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.

  • Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach:

  • The model estimated values are usually more well-behaved than the true values

  • If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values.

  • KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.

  • Advantages:

  • k-nearest neighbour can predict both qualitative & quantitative attributes

  • Creation of predictive model for each attribute with missing data is not required

  • Attributes with multiple missing values can be easily treated

  • Correlation structure of the data is taken into consideration

  • Disadvantage:

  • KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.

  • Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

How to detect Outliers in your dataset and treat them?

In the last two articles of this series (data exploration & preparation), we looked at Variable identification,Univariate, Bi-variate analysis and Missing values treatment. In this article, we will look at the next step of data preparation: Outlier detection and treatment.

What is an Outlier?

Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.

What are the types of Outliers?

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Let us understand this with an example. Let us say we are understanding the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

What causes Outliers?

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The method to deal with them would then depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories:

  1. Artificial (Error) / Non-natural

  2. Natural.

Let’s understand various types of outliers in more detail:

  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.

  • Measurement Error: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.

  • Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.

  • Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.

  • Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.

  • Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.

  • Natural Outlier: When an outlier is not artificial (due to error), it is a natural outlier. For instance: In my last assignment with one of the renowned insurance company, I noticed that the performance of top 50 financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error. Hence, whenever we perform any data mining activity with advisors, we used to treat this segment separately.

What is the Impact of Outliers on a dataset?

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

  • It increases the error variance and reduces the power of statistical tests

  • If the outliers are non-randomly distributed, they can decrease normality

  • They can bias or influence estimates that may be of substantive interest

  • They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.

Example:

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 5.45. But with the outlier, average soars to 30. This would change the estimate completely.

How to detect Outliers?

Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

  • Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR

  • Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier

  • Data points, three or more standard deviation away from mean are considered outlier

  • Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding

  • Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.

  • In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Feature Engineering: How to transform variables and create new ones?

One of common advice machine learning experts have for beginners is – focus on Feature Engineering. Be it a beginner building his first model or someone who has won Kaggle competitions – following this advice works wonders for every one!

I have personally seen predictive power of several models improve significantly with application of feature engineering.

What is Feature Engineering?

Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful.

For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract meaningful insights from the data. This is because the foot fall is less affected by the day of the month than it is by the day of the week. Now this information about day of week is implicit in your data. You need to bring it out to make your model better.

This exercising of bringing out information from data in known as feature engineering.

Process of Feature Engineering:

You perform feature engineering once you have completed the first 5 steps in data exploration – Variable Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers Treatment. Feature engineering itself can be divided in 2 steps:

  • Variable transformation

  • Variable / Feature creation.

These two techniques are vital in data exploration and have a remarkable impact on the power of prediction. Let’s understand each of this step in more details.

What is Variable Transformation?

In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

Let’s look at the situations when variable transformation is useful.

When should we use Variable Transformation?

Below are the situations where variable transformation is a requisite:

  • When we want to change the scale of a variable or standardize the values of a variable for better understanding. While this transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution

  • When we can transform complex non-linear relationships into linear relationships. Existence of a linear relationship between variables is easier to comprehend compared to a non-linear or curved relation. Transformation helps us to convert a non-linear relation into linear relation. Scatter plot can be used to find the relationship between two continuous variables. These transformations also improve the prediction. Log transformation is one of the commonly used transformation technique used in these situations.

  • Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.

  • Variable Transformation is also done from an implementation point of view (Human involvement). Let’s understand it more clearly. In one of my project on employee performance, I found that age has direct correlation with performance of the employee i.e. higher the age, better the performance. From an implementation stand point, launching age based progamme might present implementation challenge. However, categorizing the sales agents in three age group buckets of <30 years, 30-45 years and >45 and then formulating three different strategies for each group is a judicious approach. This categorization technique is known as Binning of Variables.

What are the common methods of Variable Transformation?

There are various methods used to transform variables. As discussed, some of them include square root, cube root, logarithmic, binning, reciprocal and many others. Let’s look at these methods in detail by highlighting the pros and cons of these transformation methods.

  • Logarithm: Log of a variable is a common transformation method used to change the shape of distribution of the variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.

  • Square / Cube root: The square and cube root of a variable has a sound effect on variable distribution. However, it is not as significant as logarithmic transformation. Cube root has its own advantage. It can be applied to negative values including zero. Square root can be applied to positive values including zero.

  • Binning: It is used to categorize variables. It is performed on original values, percentile or frequency. Decision of categorization technique is based on business understanding. For example, we can categorize income in three categories, namely: High, Average and Low. We can also perform co-variate binning which depends on the value of more than one variables.

What is Feature / Variable Creation & its Benefits?

Feature / Variable creation is a process to generate a new variables / features based on existing variable(s). For example, say, we have date(dd-mm-yy) as an input variable in a data set. We can generate new variables like day, month, year, week, weekday that may have better relationship with target variable. This step is used to highlight the hidden relationship in a variable:

There are various techniques to create new features. Let’s look at the some of the commonly used methods:

  • Creating derived variables: This refers to creating new variables from existing variable(s) using set of functions or different methods. Let’s look at it through “Titanic – Kaggle competition”. In this data set, variable age has missing values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we decide which variable to create? Honestly, this depends on business understanding of the analyst, his curiosity and the set of hypothesis he might have about the problem. Methods such as taking log of variables, binning variables and other methods of variable transformation can also be used to create new variables.

  • Creating dummy variables: One of the most common application of dummy variable is to convert categorical variable into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take categorical variable as a predictor in statistical models. Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We can produce two variables, namely, “Var_Male” with values 1 (Male) and 0 (No male) and “Var_Female” with values 1 (Female) and 0 (No Female). We can also create dummy variables for more than two classes of a categorical variables with n or n-1 dummy variables.

For further read, here is a list of transformation / creation ideas which can be applied to your data.

5 Simple manipulations to extract maximum information out of your data

How would you distinguish between best, good and worst analyst from a group of analysts?

I would simply provide them same problem and data set and ask them to create a predictive model out of it.

Why is this a sure shot test?

This is the core of any analysis or any data mining activity. The degree to which an analyst is able to bring out insights and trends is what differentiates a good analyst from a bad one. This is the sole idea on which competitions such as the ones running on Kaggle are built. In larger scheme of things, it is this ability to extract information from data, which differentiates companies like Google, Amazon and Capital One from their competitors.

There are 2 key steps involved every time you try and create this competitive advantage:

  1. Cleansing and transforming datasets to create a comprehensive data set which captures all possible hypothesis

  2. Choosing the right set of predictive modeling tools and techniques which work on the data set created in last step and bring out insights

Both the steps in this process are equally important. We’ll focus on the first step in this article and discuss step 2 at a later date.

Building a comprehensive data set takes time and effort. The more time you can spend in transforming variables and extracting information, the better your model would be expected to perform. You cannot and should not rely entirely on your tools to identify trends, insights or relationships.

This article addresses a situation which we face on a daily basis as analysts. How do we extract maximum information from a defined data set? By no means, this is an exhaustive list of extracting every possible information. But, it is good enough to get you thinking on the right track. If you have any other thoughts / practices which you use to extract more information about the variables.

1. Create variables for difference in date, time and addresses

While you might be using date and time values on their own, you can create new variables by considering differences in dates and time. Here is an example hypothesis: An applicant who takes days to fill in an application form is likely to be less interested / motivated in the product compared to someone who fills in the same application within 30 minutes. Similarly, for a bank, time elapsed between dispatch of login details for online portal and customer logging in might show customers’ willingness to use online portal.

Another example is that a customer living closer to a bank branch is more likely to have a higher engagement than a customer living far off.

2. Create new ratios and proportions

Instead of just keeping past inputs and outputs in your dataset, creating ratios out of them might add a lot of value. Some of the ratios, I have used in past are: Input / Output (past performance), productivity, efficiency and percentages. For example, in order to predict future performance of credit card sales of a branch, ratios like credit card sales / Sales person or Credit card Sales / Marketing spend would be more powerful than just using absolute number of card sold in the branch

3. Apply standard transformations

By looking at variations and plots of a variable along with output, you can see if applying basic transformations on variables creates a better relationship. Most commonly used transformations include Log, exponential, quadratic and trignometric variations. For example, Log(Marketing spend) might have a more representable relationship with Sales as compared to absolute Marketing spend.

4. Include effect of influencer

Influencer can impact on the behaviour of your study significantly. Influencer could be of various form and sizes. It could be an employee of your Organization, agent of your Organization or a customer of your Organization. Bringing the impact of these related entities can improve the models significantly. For example, a loan initiated by a sub-set of brokers (and not all brokers) might be more likely to be transferred to a different entity after the lock-in period. Similarly, there might be a sub set of Sales personnel involved who do a higher cross-sell to their customers.

5. Check variables for seasonality and create the model for right period

A lot of businesses face some kind of seasonality. It could be driven by tax benefits, festive season or weather. If this is the case, you need to make sure that the data and variables are chosen for the right period. This article can give you more details on how to tackle effects of seasonality while modeling.

What do you think of these manipulations? Are there some other transformations which you use while creating a predictive model? In case there are, please add them in the comments below

Comprehensive guide for Data Exploration in R

Introduction

Till now we have already covered a detailed tutorials on data exploration using SAS and Python. What is the one piece missing to complete this series. I am sure you guessed it right. In this article I will give a detailed tutorial on Data Exploration using R. For reader ease, I will follow a very similar format we used in Python tutorial. This is just because of the sheer resemblance between the two languages.

Here are the operation I’ll cover in this article (Refer to this article for similar operations in SAS):

1.How to load data file(s)?
2.How to convert a variable to different data type?
3.How to transpose a table?
4.How to sort Data?
5.How to create plots (Histogram, Scatter, Box Plot)?
6.How to generate frequency tables?
7.How to do sampling of Data set?
8.How to remove duplicate values of a variable?
1.How to group variables to calculate count, average, sum?
2.How to recognize and treat missing values and outliers?
3.How to merge / join data set effectively?

Part 1: How to load data file(s)?

Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In R, it is easy to load data from any source, due to its simple syntax and availability of predefined libraries. Here, I will take examples of reading a CSV file and a tab separated file. read.table is also an alternative, however, read.csv is my preference given the simplicity.

Code:

All other Read commands are similar to the one mentioned above.

Part 2: How to convert a variable to different data type?

Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.

Use is.xyz to test for data type xyz. Returns TRUE or FALSE Use as.xyz to explicitly convert it.

However, conversion of data structure is more critical than the format transformation. Here is grid which will guide you with format conversion :

Part 3: How to transpose a Data set?

It is also some times required to transpose a dataset from a wide structure to a narrow structure. Here is the code you use to do the same :

Code

Part 4: How to sort DataFrame?

Sorting of data can be done using order(variable name) as an index . It can be based on multiple variables and ascending or descending both order. Code

Part 5: How to create plots (Histogram)?

Data visualization on R is very easy and creates extremely pretty graphs. Here I will create a distribution of scores in a class and then plot histograms with many variations.

score <-rnorm(n=1000, m=80, sd=20)

hist(score)

Let’s try to find the assumptions R takes to plot this histogram, and then modify a few of those assumptions.

histinfo<-hist(score)

histinfo

$breaks

[1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

$counts

[1] 2 5 19 52 84 141 195 201 152 81 39 25 3 1

$density

[1] 0.0002 0.0005 0.0019 0.0052 0.0084 0.0141 0.0195 0.0201 0.0152

[10] 0.0081 0.0039 0.0025 0.0003 0.0001

$mids

[1] 15 25 35 45 55 65 75 85 95 105 115 125 135 145

$xname

[1] "score"

$equidist

[1] TRUE

attr(,"class")

[1] "histogram"

As you can see, the breaks are applied at multiple points. We can restrict the number of break points or vary the density. Over and above this, we can colour the bar plot and overlay a normal distribution curve. Here is how you can do all this :

hist(score, freq=FALSE, xlab="Score", main="Distribution of score", col="lightgreen", xlim=c(0,150), ylim=c(0, 0.02))

curve(dnorm(x, mean=mean(score), sd=sd(score)), add=TRUE, col="darkblue", lwd=2)

Part 6: How to generate frequency tables with R?

Frequency tables are the most basic and effective way to understand distribution across categories.

Here is a simple example of calculating one war frequency :

attach(iris)

table(iris$Species)

Here is a code which can find cross tab between two categories :

# 2-Way Cross Tabulation

library(gmodels)

CrossTable(mydata$myrowvar, mydata$mycolvar)

Part 7: How to sample Data set in R?

For sampling a dataset in R, we need to first find a few random indices . Here is how you can find a random sample:

mysample <- mydata[sample(1:nrow(mydata), 100,replace=FALSE),]

This code will simply take out a random sample of 100 observations from the table mydata.

Part 8: How to remove duplicate values of a variable?

Removing duplicates on R is extremely simple. Here is how you do it:

> set.seed(150)

> x <- round(rnorm(20, 10, 5))

> x

[1] 2 10 6 8 9 11 14 12 11 6 10 0 10 7 7 20 11 17 12 -1

> unique(x)

[1] 2 10 6 8 9 11 14 12 0 7 20 17 -1

Part 9: How to find class level count average and sum in R?

We generally use Apply functions to do these jobs.

> tapply(iris$Sepal.Length,iris$Species,sum)

setosa versicolor virginica

250.3 296.8 329.4

> tapply(iris$Sepal.Length,iris$Species,mean)

setosa versicolor virginica

5.006 5.936 6.588

Part 10: How to recognize and Treat missing values and outliers?

Identifying missing values can be done as follows :

> y <- c(4,5,6,NA)

> is.na(y)

[1] FALSE FALSE FALSE TRUE

And here is a quick fix for the same :

y[is.na(y)] <- mean(y,na.rm=TRUE)

y

[1] 4 5 6 5

As you can see, the missing value has been imputed with the mean of other numbers. Similarly, we can impute missing values with any best value available.

Part 11: How to merge / join data sets?

This is yet another operation which we use in our daily life.

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

# merge two data frames by ID

total <- merge(data frameA,data frameB,by="ID")

# merge two data frames by ID and Country

total <- merge(data frameA,data frameB,by=c("ID","Country"))

Appending dataset is another such function which is very frequently used. To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB)

Ref From Analytic's Vidya

RECENT POSTS

FEATURED POSTS

Check back soon
Once posts are published, you’ll see them here.

FOLLOW US

  • Grey Facebook Icon
  • Grey Twitter Icon
  • Grey Instagram Icon
  • Grey Google+ Icon
  • Grey Pinterest Icon
bottom of page