Business Analytics using R: 4 Predictive Modeling Techniques

4.1 Predictive Modeling Techniques

Objectives

Understand regression analysis and types of regression models
Know and build a simple linear regression model
Understand and develop a logistic regression model
Learn cluster analysis, types and methods to form clusters
Know time series and its components
Decompose Seasonal and non-seasonal time series
Understand different exponential smoothing methods
Know the advantages and disadvantages of exponential smoothing
Understand the concept of white noise and correlogram
Apply different time series analysis like Box Jenkins, AR, MA, ARMA etc.
Understand all the analysis techniques with case studies

4.2 Regression Analysis

Regression analysis mainly focuses on finding a relationship between a dependent variable variable and one or more independent variable.

Predict the value of a dependent variable based on one or more independent variables.

Coefficient explains the impact of changes in an independent variable on the dependent variable.

Y = f(X, beta) Regression Models are generally denoted by this equation.

Where, Y is the dependent variable

X is the independent variable

beta is the unknown coefficient

Widely used in prediction and forecasting

Types of regression models

Regression Models

|-------Uni-variate

| |-------Linear

| | |------Simple

| | \------Multiple

| \-------Non Linear

\-------Multivariate

|-------Linear

\-------Non Linear

In Uni-variate Models the response variable is affected by just one predictor variable. It is the simplest from of statistical analysis.
Correspondingly the Multivariate Models refer to models where the response variable is affected by more than one predictor variable.
The Uni-variate and Multivariate Models can be further classified as Linear and Non-Linear Models.
In the linear model the model is fitted with straight line else is considered as non-linear models.
The Uni-variate Linear Model is further divided into Simple and Multiple. Usually more than one independent variable have the influence on dependent variable. When one independent variable used in a regression is called a simple regression. When two or more independent variable are used is called a multiple regression.

4.3 Simple Linear Regression

It's a common technique to determine how one variable of interest is affected by another.
It is used for three main purposes:

For describing the linear dependence of one variable on the other.
For prediction of values of other variable from the one which has more data.
Correction of linear dependence of one variable on the other.

A line is fitted through the group of plotted data.
The distance of the plotted points from the line gives the residual value.
The residual value is a discrepancy between the actual and the predicted value.
The procedure to find the best fit is called the least-squares method.

Linear Regression Model

The equation that represents how an independent variable is related to a dependent variable and an error term is regression model.

y = B0 + B1 x + e

Where, B0 and B1 are called parameters of the model, and e is a random variable called error term.

Getting the estimates of B0 and B1, i.e. E(Y|X) means finding the best straight line that can be drawn through the scatter plot Y vs X. This is done by Least Square(LS) estimates.

Simple Linear Regression - Graphical understanding.

Diagram here depicts a graphical plotting of linear regression.

The points in blue are the observed value of Y for the corresponding x values.
The straight line is the linear model defined by linear regression model equation discussed in previous section.
B1 is the slope of the equation i.e. with one unit of change x y changes by B1. If the value is positive it means that x and y are positively correlated and if it is negative then the two variables are negatively correlated.
B0 is the value of Y at X = 0 i.e. the intercept of the equation.
The point of the straight line at X equals X0 is the predicted value of Y at X = X0. The difference between the observed value and predicted value is the residual or error term.

Process to build a regression model

Identify the target variable.
Identify the predictors
Data collection
Decide the relationship (Simple Data Analysis and Scatter plot is done to identify this)
Fit the model (derive a mathematical equation to to help predicting the response variable)
Evaluate the model (To check the efficiency of the fitted model in predicting the outcomes)

Linear Regression Model Assumptions

The predictor variable x is non-linear
The error term e is random
Error term follows normal distribution
Standard Deviation of error is independent of x.
The data being used to estimate the parameters should be independent of each other
If any of the above assumptions are violated, modelling procedure must be modified.

4.4 Coefficient of Determination - R^2

A measure of goodness of fit - How well your model does fit the data?

We will now look at how different values of R are interpreted.

In the first figure the line is perfectly horizontal and the R^2 is 0, which implies no linear relationship.
In the second figure the R^2 value is -1, which implies a negative linear relationship.
In the last figure the R^2 value is +1, which implies a positive linear relationship.

4.5 How good is the model?

Based on R^2 value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
The differences between observations that are not explained by the model is the error term or residual.
Suppose we have a case in which R^2 value is 0.74. This means that 74% of variance in the values of the dependent variable is explained by the model and the remaining 26% which is not explained is its residual or error term.

4.6 How to find linear regression equation?

SUBJECT	AGE(X)	GLUCOSE LEVEL(Y)	XY	X^2	Y^2
1	43	99	4257	1849	9801
2	21	65	1365	441	4225
3	25	79	1975	625	6241
4	42	75	3150	1764	5625
5	57	87	4959	3249	7569
6	59	81	4779	3481	6561
Sum	247	486	20485	11409	40022

We will now look at an example to find linear regression equation.

In our example we will look at two variable. Age as X and corresponding Glucose Level as Y.

We are going to see how to make a linear regression line for these variable.

The general equation for regression analysis is Y' = A + BX.

In order to calculate this equation manually we need to calculate three more variables.

XY, X^2 and Y^2 values. Then we calculate the sigma values by summing up the values of all these variables.

ΣX = 247

ΣY = 486

Σ(XY) = 20485

ΣX^2 = 11409

ΣY^2 = 40022

Here in order to derive the regression equation we need to find the intercept A value and coefficient of independent value B.

To calculate the value of intercept A use the formula.

A = [ ΣY.Σ(X^2) - ΣX.Σ(XY) ] / n.Σ(X^2) - (ΣX)^2

= ( 486 * 11409 - 247 * 20485) / (6 * 11409 - 247^2)

= 65.14157152

Then we need to calcualte the coefficient of dependent value B.

To calcualte this value B use the formula.

B = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)

= ( 6 * 20485 - 247 * 486) / (6 * 11409 - 247^2)

= 0.385224983

From this values A and B we can obtain the final regression equation:

Y' = A + BX

= 65.14157152 + 0.385224983 * X

From the above equation we can calculate the future Y value by substituting the future X value.

Also:

Intercept(a) = (ΣY - b(ΣX)) / N

4.7 Commands to perform linear regression in R

R provides comprehensive support for linear regression. In order to perform linear regression model in R we need to use the lm() function to fit linear model. We will use R's inbuilt help to check on the functionality of lm().

Refer: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

This lm functionality contains various arguments.

lm(formula, data, subset, weights, na.action,

method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,

singular.ok = TRUE, contrasts = NULL, offset, ...)

But we are going to see most frequently used arguments in regression analysis.

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

The formula in the lm model is specified using the tilda(~) symbol. The syntax is response ~ terms. Response are the numeric dependent vectors and the terms are dependent variables or the independent vectors for response. We can also use plus(+) symbol to include more predictor terms.

lm model has various components in it. Two of the most important components are coefficients and residuals.

coefficients - named vector of coefficients

residual - the residual, that is response minus fitted values.

Demo: how to draw the regression line for the given data set in R.

The scatter plot is generally used to plot the quantitative variables and display them as geometric points inside a Cartesian diagram.

To perform this let us take faithful data set which is already loaded in R.

> head (faithful)

eruptions waiting

1 3.600 79

2 1.800 54

3 3.333 74

4 2.283 62

5 4.533 85

6 2.883 55

The data set contains two variables erruptions and waiting. Erruption column specifies time of an erruption in minutes and waiting specifies the time between two erruptions.

First let us make a scatter plot of erruption duration and waiting intervals and then try to find out if there is any relationship between the variables.

> attach (faithful)

> plot (eruptions, waiting, xlab="Erruption Duration", ylab="Time Waited")

The result plot shows that there is a positive linear relationship between eruption and time waited.

This depicts the fact that if time waited is high then the eruption duration will also be higher.

We can also generate a linear regression model with the lm functionality and then draw a trendline using abline functionality.

> abline(lm(waiting~eruptions))

4.8 Linear Regression to Predict Sales : Case Study 1

Chip Chops company is a global ice-cream manufacturing company specializing in fruit and nut flavored ice-cream. They have a very wide customer base spanning almost all parts of the world. They want to find some useful insights regarding individual and social consumer consumption patterns so that they can make changes to their business that may yield them a higher consumption rate of ice-cream by individuals. They hired an R expert Richard to work on this situation and asked him to come up with useful insights that in turn may help build their profitability and increase consumer consumption rate. The data was readily available with the company and the firm wanted Richard to work on a sample data that give them insights before trusting him with the project. So now Richard has a sample set of data from the company and the data hold 4 variables. The first variable specifies the consumption of ice-cream per person and it is a numeric variable. The second variable specifies the average family income per week in US dollars. The third variable tells about the price of ice-cream per unit. The fourth variable specifies the average temperature unit that is experienced in the city in terms of Fahrenheit. The companies aim is to extract more customers and increase the consumption rate and ultimately increase their sales profits. To find this Richard needs to find the important factors causing impact on the consumption rate.

In order to find this relationship he wants to perform a linear regression. This regression analysis will help us in finding the relationship between the factors and help us in predicting the future consumer consumption rate. In linear regression analysis as we know there is the dependent variable and various independent variables. In our case we need to find the pattern and consumption rate so it is being declared as dependent variable and factors which it relies on are the independent variables. The remaining three variables are the independent variables. And in this case they are average income, price and temperature. Now we will perform the linear regression to see how they are affecting consumption rate.

In order to perform linear regression in R, the first step you need to do is load your data set in R workspace.
> data <- read.csv("E:/RWD/SimpliLearn/Video2-Icecream.csv")
> View(data)

So we will now see how to perform linear regression using lm function to find the dependent and independent variable and store the result in a new variable. Here in our example it is given as

> analysis <- lm(cons~income+price+temp,data=data)
> summary(analysis)

Call:
lm(formula = cons ~ income + price + temp, data = data)

Residuals:
Min 1Q Median 3Q Max
-0.065302 -0.011873 0.002737 0.015953 0.078986

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1973151 0.2702162 0.730 0.47179
income 0.0033078 0.0011714 2.824 0.00899 **
price -1.0444140 0.8343573 -1.252 0.22180
temp 0.0034584 0.0004455 7.762 3.1e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03683 on 26 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.6866
F-statistic: 22.17 on 3 and 26 DF, p-value: 2.451e-07

>

The results provide the call values i.e the formula that is been used for regression analysis is being displayed.
Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.
The important interpretation is always made from the term coefficients. This provides the intercept values and estimate values of all independent variables.
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables.
We also predict the future values from this equation.
And the accuracy of these values always relies on residual term that is being generated.
In the residual term the values are less. If the residual term values are less, the accuracy of the predicted values will be more.
Here from the obtained coefficients we can draw the regression line as:

cons = 0.19 + 0.0033 * income - 1.0444 * price + 0.0034 * temp + residuals

From this regression line we can predict the future line by defining the income, price and temperature with its residual terms.

The results also displays the P-value from the low values of P for the income and temp variables and the significance codes we can see that these two variables are more significant. From this and the positive coefficient values we can conclude that the income of the people is higher and the temperature of the day is more then there will be more sales of ice-cream and the consumption rate of individuals will also get increased. From the other row of the price we can deduce that the value price is negative then their consumption rate is higher i.e. if the price of the ice-cream goes down the demand for ice-cream goes higher and people tend to buy it more.

The residual standard error i.e. being calculated is about 0.036. And such a low value will not have much impact on the predicted sales.
The results also provide the Multiple R-Squared and Adjusted R Squared values.
Multiple R-Squared values is obtained about 0.71 i.e. 71% of relationship between the dependent variable can be explained with the three given factors.

Finally he F-statistics and the P-values are displayed. The P values i.e. being obtained is less than 0.05 So we can reject the null hypothesis and conclude that there is some linear relationship between the dependent and independent variables.

We can summarize the results as follows:
If price of ice-cream gets lowered the consumption rate of individuals will increase.
If the income of people and temperature of particular locality is higher then obviously the demand for ice-cream will increase and the consumption rate of ice-cream per individual will go higher.
These insights will help the company authority to make easy decisions on the pricing and henceforth their profitable range could be extended.

4.9 Linear Regression: Case Study 2

Analytic startup is involved in analytics for reverse logistics. The people already employed by this company have different working experiences and they draw different salaries. The company decided to monitor the working skills of each employee and so they introduced a format called score board rating. In this format each employee is rewarded a score based on his or her working performance. The employee is rewarded higher score if the working performance for that period has been good. The company wants to analyse and confirm that if the higher experience and higher score card rates leads to employee higher amount of salary. In order to analyze this we need to have the data regarding salary, experience and score rating of individual employees. Since it is a relatively new startup they have only 20 employees and we have the details of all the employees. The first variable specifies the years of relevant work experience of the employee. The second variables specifies the score rating that the employee earned by his performance. The third variable specifies the amount that the employee is drawing as the salary in thousands of dollars. In order to find the relationship between these variables we need to perform a linear regression model in R. By this we can establish relationship between variables and we can also predict the future values of the dependent variable by constructing a regression line.

Before performing regression analysis in R, we need to load the data set in R for which we are going to perform analysis:

> mydata <- read.csv("E:/RWD/SimpliLearn/Video3-rating.csv")

> head(mydata)

experience score salary

1 4 78 24.0

2 7 100 43.0

3 1 86 23.7

4 5 82 34.3

5 8 86 35.8

6 10 84 38.0

Here in our example we need to check whether the salary is depended upon the score card rating and work experience. So the dependent variable is salary and independent variable are score and experience.

> fit <- lm(salary~score+experience,data=mydata)

> summary(fit)

Call:

lm(formula = salary ~ score + experience, data = mydata)

Residuals:

Min 1Q Median 3Q Max

-4.3586 -1.4581 -0.0341 1.1862 4.9102

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.17394 6.15607 0.516 0.61279

score 0.25089 0.07735 3.243 0.00478 **

experience 1.40390 0.19857 7.070 1.88e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.419 on 17 degrees of freedom

Multiple R-squared: 0.8342, Adjusted R-squared: 0.8147

F-statistic: 42.76 on 2 and 17 DF, p-value: 2.328e-07

The result show the first line call which provides the formula which is being specified to perform regression analysis.

Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.

The important interpretation is always made from the term coefficients.
This provides the intercept values and estimates standard error, t-statistics and p-values of both experience and score (independent variables).
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables as

Salary = 3.173 + 1.403 * experience + 0.250 * score + residuals

Using this regression line we can predict the future values of salary by defining the experience and score rating of each employee.
The residual values obtained should be lower as otherwise it would affect the model.
If the residual term values are less, the accuracy of the predicted values will be more.
Since both the coefficients are positive we can also conclude from the given data that if both experience and score rating is high then the salary obtained by employee is also positive i.e. higher.
The standard error value obtained is also very low and so the deviation between predicted and actual value is also low.

This means that this model can predict the accurate values very easily.

The T-value obtained is higher and the P-value for both experience and score is very low and less than 0.05.

This shows that we need to reject the null hypothesis and conclude that there is some relationship between the dependent and independent variable. The significance codes also suggest the same.
Next the standard residual error and the degrees of freedom of those variables are also being displayed.

The results also contains the Multiple R-Squared and Adjusted R-Squared value

The R-Squared value that is being obtained about 0.81 suggest that there is a positive correlation between the dependent and independent factors. In other words the difference in the values of the salary can be explained for 81% of the cases with the given experience and rating factor. If independent variables move toward the positive trend obviously the dependent variable also move towards the positive trend.

In terms of regression equation we can make conclusions like the salary increase is expected to be additional 1403$ for each year of experience when the attribute score is at the same level.
The insight we are drawing from this analysis is that in this analytics company an employee with higher working experience and with higher score ratings then he would obviously draw higher salary. And correspondingly if his experience and score rating is low he would obviously draw less salary.

Thus confirming our earlier assumptions.

From this analysis we can also predict in future what salary should be provided for a newly joined employee of this company based on his or her individual working experience and the score rating.

4.10 Case Study - Classification using Linear Regression

By now we know several usage of linear regression and why it is one of the widely used models around. This case study will also delve how linear regression can also be used to do classification. Surprised! Let's see how? We will first install "mlbench" package in R.

install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes2)
head(PimaIndiansDiabetes2)

We already know how to install a package. After installating the package, it needs to be loaded with library() command. We will be using a database called "PimaIndiansDiabetes2" from the package. The data consists of the population if 392 women with Pima Indian Heritage who live in the area of Phoenix Arizona. They were tested for diabetes. They were tested for diabetes. The goal is to get a classification rule for the diagnosis of diabetes.
The variables are:
pregnant - number of times pregnant
glucose - plasma glucos concentration glucose tolerance test
pressure - diastolic blood pressure in mmhg.
triceps - triceps skin fold thickness in mm
insulin - 2 hours serum of insulin m units per ml.
mass - body mass index i.e. (weight in kgs. / height in meters square)
pedigree - diabetes pedigree function
age - age in years
diabetes - positive or negative

Here we need to classify the data into positive or negative.
Let us look at the complete data. We see that their are many values which can be removed or imputed.
For the sake of simplicity we will remove the values.

For that we use the na.omit() function.

pidna <- na.omit(PimaIndiansDiabetes2)

This reduces our dataset to 393. Now for linear regression we need numerical values. In our case all the columns have numerical values but diabetes column has categorical values. For this we will use the as.numeric() function. Also R gives a value 2 for POS and 1 for NEG so we give the code in such a way that we get 0 for negative and 1 for positive. We look at the data using the head() function to look at the diabetes data if converted to 0 or 1.

pidlm <- pidna
pidlm$diabetes < as.numeric(pidna$diabetes)-1
head(pidlm)

Now the data set can be used to apply linear regression.
From this data set we create a train set. We get the first 300 records and rest 92 records are kept into the variable called test.

test=pidlm[(301:392),]
train=pidlm[(1:300),]

Business Analytics using R

Sunday, March 13, 2016

4 Predictive Modeling Techniques