Business Analytics using R: 2016

Saturday, March 26, 2016

2 Statistical Concepts and their application in business

Objectives

Statistical Methods
Population and Samples
Sampling Plan and Sampling Methods
Descriptive Statistics and components
Probability Theory and Distributions
Confidence Interval
Concept of Tests of Significance
One Sided and Two Sided Hypothesis Testing
Various Tests of Significance
Non-Parametric Testing

2.1 Statisitical Methods

Statistics is a applied/business mathematics where we collect, organize, analyze, and interpret numerical facts

Descriptive Statistics

Measure of Central Tendency
Measure of dispersion
Sample

Inferential Statistics

Estimation
Hypothesis Testing
Population

Population and Samples

Population is any entire collection of objects or observations from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
For each population there are many possible samples
It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.
A sample is a group of units selected from larger group (the population). By studying the sample it is hoped to draw valid conclusion about the larger group.
A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling.

Developing a sampling plan

Define the target population - in terms of number of elements, sampling unit,extent and time.
Select a sampling method - probability or non-probability sampling.
Obtain the sampling frame - must contain all potential factors.
Determination of sample size - for desired level of accuracy.
Choose data collection method - procedure to obtain the data.
Develop operational plan - which technique fits the best.
Execute operational plan - verification of specified procedure.

Sampling Techniques

Sampling

Probability

Simple Random - Purest form, where every member has equal probability of participating.
Systematic - Selection of elements from an ordered sampling frame.
Stratified - Dividing the member of the population into homogeneous sub groups (strata) before sampling.
Cluster - When natural but relatively homogeneous groupings are evident in a population.

Non-Probability

Convenience - Sample that is convenient to collect.
Judgemental - Sample is selected with a specific attribute based on the judgement of the research
Quota - Two stages of judgemental, first develope control categories or quotas, second collect sample based on convenience or judgement to fill in the quota.
Snowball - Initial group of respondants are selected usually at random or from contacts of existing customers.

Descriptive Statistics

Analyze data to extract meaningful information

Non-conclusive asit is only limited to the data being analyzed

Score Range	Number of Students
Below 40	20
40-50	22
50-60	33
60-70	21
70-80	13
>80	5
Total	114

Histograms are used to graphically represent the data.

Measures of Central Tendencies

Measure of Central Tendency is a method of descriptive statistics which:

Identify with a single value

Also called measure of central location

It falls into the category of summary statistics

Mean
Median
Mode

Mean

Mean is the average of the numbers

A calculated "central" value of a set of numbers

Mean = (Sum of all numbers)/(Count of all numbers)

2 2 6 10

Mean = (2+2+6+10)/4= 20/10 = 5

Median

Median is the number in the middle

Number of values above and below median are same

3 3 4 5 7 8

3 3 (4 5) 7 8

Median = (4+5)/2 = 4.5

Mode

Calculate frequency of the occurence

Mode is the value that occurs often

Let us look at the below histogram (x=data, y=frequency)

Freq| 20 23 (33) 22 17 06 02

______________________________> Data

00 01 02 03 04 05 06

It is clearly visible that number 02 has highest frequency in the data set.

When to use what?

Mean

The average is required for statistical analysis
The variable is continuous/ discrete
Mean is commonly used in case of quantitative variables

Median

The variable is discrete
There are abnormal extreme values/Non-normal data
The characteristics under study is qualitative

Mode

Least commonly used
The variable is discrete
There are abnormal extreme values
The characteristic under study is qualitative

Measure of Dispersion

Describe the amount of heterogeneity or variation within a distribution of scores
The spread or dispersion of a set of scores around some central value
Measure of Dispersion include:

Variance
Standard Deviation

Variance and Standard Deviation

Variance is an average of squared deviations about the mean

S Square = (Σ(x - ̅x)²)/n

Standard Deviation is the squared root of variance

_________________

Std.Dev.(σ) = √( (Σ(x - ̅x)²)/n )

Example data: 2, 5, 5, 4, 6, 8

n = 6
Mean = ( 2 + 5 + 5 + 4 + 6 + 8 ) / 6 = 5
Variance = ( (2-5)² + (5-5)² + (5-5)² + (4-5)² + (6-5)² + (8-5)² )/5
= ( (-3)² + (0)² + (0)² + (-1)² + (1)² + (3)² ) / 5
= ( 9 + 0 + 0 + 1 + 1 + 9 )/ 5 = 20/5 = 4
_________ __
Std. Deviation (σ) = √ Variance = √ 4 = 2

Case Study - Descriptive Statistics

Business Case: A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.

The telecom company surveyed 1000 of its customers on all the above services.

User Descriptive analysis to study customer spending to determine which services are most profitable.

	N	Valid N	Min	Max	Mean	Std. Dev.
Long Distance last month	1000.00	1000.00	0.90	99.95	11.72	10.36
Toll free last month	1000.00	475.00	0.00	173.00	13.27	16.90
Equipment last month	1000.00	386.00	0.00	77.70	14.21	19.07
Calling card last month	1000.00	678.00	0.00	109.25	13.78	14.08
Wireless last month	1000.00	296.00	0.00	111.95	11.58	19.72

From the above table we can make following insights:

On average, customer spend the most on equipment rental, but there is a lot of variation in the amount spent.
Customers with calling card service spend only slightly less, on average, than equipment rental customers, and there is much less variation in the values.
The real problem here is that most customers don't have every service, so a lot of 0's are being counted. One solution to this problem is to treat 0's as missing values so that the analysis for each service become conditional on having that service.

2.2 Probability Theory

Probability is a branch of mathematics that deals with the uncertainty of an event happening in the future.
Probability values always occurs within a range of 0 and 1.

Probability of an event, P(E) = (No. of favorable occurrences)/(No. of possible occurrences)

Let us take a simple example of tossing an unbiased coin.
Probability of getting a head or tail is = 1/2

Assigning Probabilities

Classic Method - based on equally likely outcomes.
E.g. Rolling a dice.
The probability of each number 1, 2, 3, 4, 5, 6 occurring out of total 6 outcomes is 1/6.

Relative frequency method - based on experimentation or historical data.
E.g. A car agency has 5 cars. His past record as shown in the table shows his car used in past 60 days.

No. of Cars Used	No. of days	Probability
0	3	(3/60) = 0.05
1	10	(10/60) = 0.17
2	16	(16/60) = 0.27
3	15	(15/60) = 0.25
4	9	(9/60) = 0.15
5	7	(7/60) = 0.12

Subjective method - based on judgement.

E.g.: 75% chance that England will adopt to Euro currency by 2020.

Probability Distribution

Probability distribution for a random variable gives information about how the probabilities are distributed over the values of that random variable.

It's defined by f(x) which gives probability of each value.

E.g. Suppose we have sales data for AC sale in last 300 days.

Unit Sold	No. of days	Probability	f(x)
0	10	(10/300) =	0.03
1	55	(55/300) =	0.18
2	150	(150/300) =	0.50
3	55	(55/300) =	0.18
4	25	(25/300) =	0.08
5	5	(5/300) =	0.02

Binomial Distribution

Binomial Distribution satisfies:

A fixed number of trials
Each trial is independent of the others
The probability of each outcome remains constant from trial to trial.

Example of binomial experiments

Tossing a coin 20 times, what is the probability of getting head 5 times?

Getting a diamond king from a pack of 52 cards.

Case Study - Binomial distribution

Example of binomial distribution: Amir buys a chocolate bar every day during a promotion that says one out of six chocolate bars has a gift coupon within.
Answer the following questions:

What is the distribution of the number of chocolates with gift coupons in seven days?
What is the probability that Amir gets no chocolates with gift coupons in seven days?
Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?
Amir buys a bar everyday for six weeks. What is probability that he gets at least three gift coupons?
How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?

(Assume that the conditions of binomial distributions apply: the outcomes for Amir's purchases are independent, and the population of chocolate bars is effectively infinite.)

Steps:
Formula = ⁿC_r p^r q^n-r
Where,
n is the no. of trials
r is the number of successful outcomes
p is the probability of success
q is the probability of failure

Other important formula include

p+q=1

q=1-p

Thus,

p=1/6

q=1-(1/6)=5/6

1. Distribution of number of chocolates with gift coupons in 7 days:

⁷C_r (1/6)^r (5/6)^7-r

2. Probability of failing 7 days:

P(x=0) = ⁷C₀ (1/6)⁰ (5/6)^7-0= (5/6)⁷

3. Probability of winning a coupon on 7th day: 1/6

4. The number of winning at least 3 wrappers in six weeks:
P(X>=3) = 1 - P (X<=2)
= 1 - (P(X=0)+P(X=1)+P(X=2))
= 1 - (0.0005 + 0.0040 + 0.0163)
= 0.979
5. Number of purchase days required so that probability of success is greater than 0.95:
P(X>=1) >= 0.95 (As per Binomial Distribution)
>> P(X=1) + P(X=2) + ... + P(X=6) >= 0.95 but since Sum of P(X=r) = 1 so,
>> 1- P (X=0) >=0.95
>> P(X=0) <=0.05
>> (5/6)n <= 0.05 taking log both sides to solve this exponential quation
>> n log(5/6) <= log (0.05)
>> So, n >= 16.67
>> that is n=17 days minimum.

Normal Distribution

Normal distribution

A Normal distribution is a theoretical model of the whole population
It is perfectly symmetrical about the central value; the mean mu represented by zero.
It is also called the bell curve.
The distribution is symmetric with mean 0 and std. dev. of 1.

Poisson distribution

Discrete probability distribution for events that happen randomly in time
Following conditions need to be satisfied:

The even results in a success of failure
The average number of successes, mu is known
Probability of success is proportional to the region/time.
Probability of success in an extremely small region/time is almost zero.
Properties: Mean and variance is equal and denoted by mu.

Examples:

Average number of houses sold by a company is 5 per day. What is the probability that exactly 4 houses will be sold tomorrow?
Average number of births in a hospital is 2.1 births per hour. What is the probability that there will be exactly 6 births in the next 2 hours.

Skewness and Kurtosis

Skewness: Measure of deviation from symmetry

Difference between median and mode
Right or Left skewed
Skewness negative - more negative values (Left Skewed)
Skewness positive - more positive values (Right Skewed)

Kurtosis: measure of peakedness of the distribution

High Kurtosis - Tall peak, rapid decline in the tails.
Low Kurtosis - flat peaks, gradual decline in the tails.
Extreme Case - Uniform distribution.

Case Study: Skewness and Kurtosis

		Skewness		Kurtosis
	N	Statisitic	Std. Error	Statisitic	Std. Error
Long Distance last month	1000	2.966	0.077	14.012	0.155
Toll free last month	475	3.465	0.112	26.735	0.224
Equipment last month	386	0.756	0.124	0.641	0.248
Calling card last month	678	2.15	0.094	7.572	0.187
Wireless last month	296	1.359	0.142	3.079	0.282

Equipment last month data is more accurate in nature and its SD is comparatively lower than the other measures.

Confidence interval

It's a rule for a population parameter to determine an interval that is likely to include the parameter based on the sample information.
Supposing that a random variable has been taken and the random samples were taken repeatedly from the population, certain percentage of interval contains unknown value.
In such case, if population is repeatedly sampled and intervals calculate in that fashion then 95% of interval contains true value of the unknown parameter.
This interval is then said to be 95% confident for the population proportion.
The upper and lower limits of the 95% confidence interval are confidence limits.
Confidence levels is the probability value that is associated with a confidence level.
The probability value is (1- alpha) This value is often represented as a percentage value.
Say for a value of alpha = 0.05 the confidence level would be 0.95. This is a 95% confidence level.

Data Requirement:

Confidence level
Statistic
Margin of Error
Range of the confidence interval = sample statistics + margin of error.
The uncertainty associated with the confidence interval is specified by the confidence level.

How to construct a Confidence Interval

Identify a sample statistic - Choose the statistic that will be used to estimate a population parameter.
The statistic is generally the mean or the median or the mode in some cases.
Select the confidence level - It describes the uncertainty of sampling method.
Find the margin of error = Critical Value * Standard error of statistic.
Specify the confidence interval - The range of the confidence interval is defined by the following equation.
Confidence interval = sample statics +/- Margin of error.
e.g. Margin of error = 1.86 and Sample statistic = 150
Confidence interval = (150 - 1.86) to (150 + 1.86)
Confidence interval = 148.14 to 151.86

2.3 Tests of Significance

Tests used in assessing the evidence in favor of or against a given assumption
Begins with a Null Hypothesis, Ho
Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
Two types of tests:

One sided tests
Two sided tests

Results decided by calculating the "p - value"
P value can be defined at the probability that the calculate test statistic can take extreme value as the absurd value given that the null hypothesis is true.
Interpretation:

If p-value is less than the significance level alpha, reject the null hypothesis.
General values of alpha are 0.05, 0.01.

General Assumptions:

The distribution is almost normal
The sample in the distribution have almost unequal variances.

One sided hypothesis testing

Muo = null value
Null hypothesis Mu = Muo
Alternative hypothesis: Mu < Muo or Mu > Muo

Example: Given a sample of heights of 100 males in New York, decide whether the height has increased in general form a given average height of 5 feet 9 inches.

Null Value: Muo = 5 feet 9 inches
Null Hypothesis: Mu = 5.9
Alternative Hypothesis: Mu > 5.9

Using one of various hypothesis tests, calculate "p-value" and reject null hypothesis if p-value is less than 0.05.

Two sided hypothesis testing

Muo = null value
Null Hypothesis: Mu = Muo
Alternative hypothesis: Mu <> Muo

Example: given a sample heights of 100 males in New York, decide whether the height has increased/decreased in general form a given average height of 5 feet 9 inches.

Null Value = Muo = 5.9
Alternative Hypothesis = Mu <> 5.9

Using one of various hypothesis tests, calculate p-value and reject null hypothesis if p-value is less than 0.05.

2.4 Tests of Significance

One Sample z test- The Z test is used to compare the mean with the given standard
Two Sample z test - The Z test is used to compare the means of two groups.
The standard deviation need not be known to calculate the Z statistics.
The Z test is generally used when the number of samples is greater than 30.
T test
The t test is used with mean statistics as well but to calculate the t statistic the standard deviation must be known the test is preferred if the number of samples is less than 30. As earlier the t test can be one sample two sample or paired t tests.
One Sample t test -
Two Sample t test - When the compared groups are independent. e.g. To compare the marks or students of two different schools.
Paired t test - When the compared groups are paired. To compare the marks of students of same schools before and after a training class.
Chi-Squared test - For goodness of fit is used to test if there is a different between the observed values and the expected values according to a particular hypothesis.
F test - Annalysis of Variance (ANNOVA) - To compare variances of two or more groups. The mostly used f test is ANNOVA.
F test - Regression - lesser used is the regression analysis.

In all the analysis tests the null hypothesis states that there is no difference between mean or variances and the alternative hypothesis suggests otherwise.

Chi-Squared Tests

Compare the observed results against an expected result based on a hypothesis
Steps:

State the null hypothesis
Prepare the contingency table for the variable
Determite the expected results
Calculate the chi-squared values
Calculate the degree of freedom
Based on the above, calculate the p-value
If p-value <0.05, reject the null hypothesis

Test of independence:

Verify if two variables are independent
Same steps as above

Case Study - Chi Squared Test

A city has a newly opened nuclear plant, and there are families staying dangerously close to the plant. A health safety officer wants to take this case up to provide relocation for the families that live in the surrounding area. to make a strong case, he wants to prove with numbers that an exposure to radiation levels is leading to an increase in diseased population. He formulates a contingency table of exposure and disease.

Exposure	Disease Yes	Disease No	Total
Yes	37	13	50
No	17	53	70
Total	54	66	120

Does the data suggest an association between the disease and exposure?

Steps:

Calculate the number of individuals of exposed and unexposed groups expected in each disease category (yes or no) if the probabilities were the same.
If there were no effect of exposure, the probabilities should be same and the chi-squared statistics would have a very low value.

Proportion of population exposed = (50/120)=0.42
Proportion of population not exposed = (70/120)=0.58

Thus, expected values:
Popolation with disease = 54
Exposure Yes: 54 * 0.42 = 22.5
Exposure No: 54 * 0.58 = 31.5

Population without disease = 66
Exposure Yes: 66 * 0.42 = 27.5
Exposure No: 66 * 0.58 = 38.5

Exposure	Disease Yes	Disease No	Total	Total Proportion
Yes Actual	37	13	50	50/120 = 0.42
Yes Expected	54 * 0.42 = 22.5	66 * 0.42 = 27.5
No	17	53	70	70/120 = 0.58
No Expected	54 * 0.58 = 31.5	66 * 0.58 = 38.5
Total	54	66	120

Calculate the Chi-Squared statistic

X^2 = Summation of [(Observed Freq. - Expected Freq.)^2/ Expected Freq]
= ((37-22.5)^2 / 22.5) + ((13-27.5)^2 / 27.5) + ((17-31.5)^2 / 31.5) + ((53-38.5)^2 / 38.5)
= 29.1

Calculate the degree of freedom:

df = (Number of rows -1) x (Number of columns -1)
df = (2-1) x (2-1)
df = 1

Calculate the p-value from the chi-squared table(found online).
For Chi-Squared value 29.1 and degree of freedom =1, from the table, p-value is < 0.001
Interpretation: There is 0.001 chance of obtaining such discrepancy between expected and observed values if there is no association.

ANNOVA

Analysis of Variance - used to compare more than two groups
Extension of the independent t-tests
Factor variable - variable defining the groups
Response variable - variable being compared
One way ANNOVA

Groups of a single variable
E.g.: Is there a difference in student's marks based on the row he is seated - front / middle / back?

Two way ANNOVA

Two independent variables
E.g.: Does the race and gender affect a person's yearly income?

Case Study - One way ANNOVA

Marks obtained in the same subject by three students belonging to three different schools are given below.
Does the data suggest any association between school and marks?

School	A	B	C
Marks 1	82	83	38
Marks 2	83	78	59
Marks 3	97	68	55

The basic idea in ANNOVA: Partition the total variation in the data into the variation between groups and variation between groups.
Steps:

Calcaute the means

School A: mean(82, 83, 97) = 87.3
School B: mean(83, 78, 68) = 76.3
School C: mean(38, 59, 55) = 50.6

Calcualte the grand mean

Grand: mean(82, 83, 97, 83, 78, 68, 38, 59, 55) = 71.4

Calculating the variations

Sum of Squared Deviations about the grand mean, across all observed values: SStotal = 2630.2
Sum of Squared Deviations of group mean about the grand mean - three group mean against the grand mean: SSbetween=2124.2
Sum of Squared Deviations of observations within a group about their group mean; added across all groups: SSwithin=506

Calculate the degree of freedom for every variance:

df_total = number of observations -1 = 9-1 = 8
df_between = number of groups -1 = 3 -1 = 2
df_within = number of observations - number of groups = 6

Calculate the Mean Squared Variances

Mean Suared variance between group MS_between = SS_between / df_between = 2124/2 = 1062
Mean Suared variance within group MS_within = SS_within / df_within = 506/6 = 84.3

Calculate the f-statistics

F-value = MS_between/MS_within = 1062.1/84.3 = 12.59

Calculate the p-value from the F-table

P-value for given f-value 12.59 and degree of freedom 2 and 6 is 0.007

Conclusion: since the p-value is less than alpha, we can conclude by rejecting the null hypothesis, that there is a difference in the marks obtained by students belonging to different groups.

2.5 Non Parametric Testing

Referred to as "distribution free", as they don't involve making assumptions of any data.
They have lower power than the parametric tests and hence are always given the second preference after the parametric tests
These tests are typically focused on median rather than mean
They involve straight forward procedures like counting and ordering
There are at least one non-parametric test done for each parametric test and are classified into following categories.

Tests of differences between groups (independent samples)
Tests of differences between variables (dependent variables)
Tests of relationship between variables

One usually computes the correlation coefficient.

Non parametric equivalence to the standard correlation coefficient are

Spearman's R
Kendall's Tau
Coefficient Gamma

Appropriate non-parametric testing for testing the relationship between the two variables are the chi-squared tests, the pi coefficient and the fisher exact test. In addition a simultaneous test for relationship between multiple cases is available. Kendall coefficient of concordance. This test is often used to express the inter-relative agreement among independent judges who are rating ranking the same simulate

Non Parametric Tests

Tests	Parametric	Non Parametric
One Qualitative Response Variable	One Sample Test	Sign Test
One Qualitative Response Variable - Two Values from Paired Samples	Paired Sample T - test	Wilcoxon Signed Rank Test
One Qualitative Response Variable - One Qualitative Independent Variable with Two Groups	Two Independent Sample T - test	Wilcoxon Rank Sum or Mann Whitney Test
One Qualitative Response Variable - One Qualitative Independent Variable with Three or more Groups	ANNOVA	Kruskall Wallis

Correlation

Measure of association between variables

Positive and negagive correlation, ranging between +1 and -1

A value of +1 or positive correlation applies that if the value of independent variable increases the value of response variable also increases.

Similarly, a value of -1 or negative correlation applies that if the value of independent variable increases the value of response variable decreases.

Positive Correlation Example:

Earning and expenditure - more a person earns more he/she spends.

Negative Correlation Example:

Speed and time - As the speed of the vehicle increases the time taken to cover a given distance decreases.

Parametric - normal distribution and hogeneous variance.

Pearson correlation

Non Parametric - no assumption, nominal variable

Spearman correlation

Correlation Coefficient

r: correlation coefficient
-1: Perfectly Negative
+1: Perfectly Positive
0 - 0.2 : No or very weak association
0.2 - 0.4 : Weak association
0.4 - 0.6 : Moderate association
0.6 - 0.8 : Strong association
0.8 - 1 : Very strong to perfect association

Summary

Overview of Statistical Methods
Population, Samples & Sampling Plan and Sampling Methods
Descriptive Statistics - Measure of Central Tendency and Measure of Dispersion
Probability Theory and Distributions
Confidence Interval
What are Tests of Significance
The process flow of hypothesis testing
One Sided and Two Sided Hypothesis Testing
Various Tests used in calculating p-value
What is Non-Parametric Testing and why it is used.
Non-parametric alternatives for the usual tests of significance

Friday, March 25, 2016

1 Business Analytics

Data Analytics

Analytics is the science of analysis where statistics, data mining, computer technology, etc.are used for analysis.
Analysis is the process of breaking down a complex object or data into simpler forms or more compact or better data for understanding.

Analysis is the science of wisely acquiring meaningful results from given data using various methods and technologies.
Aims at discovering patterns of variation from the given data.
It helps to understand the future from past data and uncertainty related to business.
It's a sophisticated process that uses statistics, mathematics and economics models to predict the future and prescribe strategies.
The processes include:

Gather Data
Organize Data
Analyze Data

Stages of Analytics:

Descriptive Analytics (Information) ~ How many students dropped out last year?
Diagnostic Analytics (Insight) ~ Why has the drop-out rate increased in the last one year?
Predictive Analytics (Insight) ~ Which students are more likely to drop-out?
Prescriptive Analytics (Decision) ~ Which student should I target to keep from dropping out?

Popular Tools used in Analytics

R
Revolution R
R Studio
Tableau
SAP HANA
Weka
KXEN
SAS

Role of a data scientist:

Inquisitive, can look at data and spot trends.
Come out with unrevealed stories hidden in data that helps in creating more useful insights and help solving business problems.
Work in sync with application developer to grant relevant data for analysis.
Make an analytical plan in such a way that the results satisfy the business needs.
Come up with an effective data mining architecture and prepare suitable models
Respond to and resolve data mining performance issues
Generate reports that are affordable from business perspective

Data Analytics Methodology

Discovery
Data Preparing
Model Planning
Deliver Results
Put into use

Problem Definition:

What is the problem?
What it is not?
We have this problem because?
We don't have a solution because?

Defining a problem:

State the problem in a general way.
Understand the nature of the problem
Survey the available literature
Go for discussions for developing ideas
Rephrase the problem into a working proposition.

Types of Data

Qualitative Data

Data expressed as groups or categories
Descriptive Data
e.g. Dividing a population into high medium and low height groups.

Quantitative Data

Data expressed as numbers
Definitive Data
e.g.The height of a person

Summarizing Data

Summarizing is the process of converting huge amounts of raw data into a format that can be easily analyzed.
Summaries differ on type of data; and can be descriptive or graphical
Numeric Data - Descriptive

Mean
Median
Mode

Numeric Data - Graphical

Box Plots

Categorical Data - Descriptive

Frequency distribution tables

Categorical Data - Graphical

Bar Charts
Histogram

Data Collection

Collect Relevant Data:
Process of collecting relevant data that aids in solving the problem statement
Categorize the Data:
Data Collection process needs to be defined and systematic.
Organize the Data:
Observations need to be recorded and organized for optimal usefulness.

Data Collection Methods

Data Collection Methods fall broadly into two categories Primary and Secondary

Primary

Observation - Measuring the data and various attributes
Experiment - Subjects are divided into groups
Surveys - Questions and Interviews help in reporting feedback and help is studying characteristics of the population.

Secondary

Data which has already been gathered before the study and is available as already published facts and reports.

Data Dictionary

A Data Dictionary is a file that describes the structure of the database itself.
It includes details like:

Number of records
Name of each field
Characteristics of each field
Description of each field
Relationship between different fields

It helps in analyzing different data variables and their relationship between each other.

Outliers and their treatment

Outliers is a point or an observation that deviates significantly from the other observations.
Occurs due to experimental errors or "special circumstances"
Outlier detection tests to check for outliers
Outlier treatment

Retention
Exclusion
Other treatment methods

Sunday, March 13, 2016

4 Predictive Modeling Techniques

4.1 Predictive Modeling Techniques

Objectives

Understand regression analysis and types of regression models
Know and build a simple linear regression model
Understand and develop a logistic regression model
Learn cluster analysis, types and methods to form clusters
Know time series and its components
Decompose Seasonal and non-seasonal time series
Understand different exponential smoothing methods
Know the advantages and disadvantages of exponential smoothing
Understand the concept of white noise and correlogram
Apply different time series analysis like Box Jenkins, AR, MA, ARMA etc.
Understand all the analysis techniques with case studies

4.2 Regression Analysis

Regression analysis mainly focuses on finding a relationship between a dependent variable variable and one or more independent variable.

Predict the value of a dependent variable based on one or more independent variables.

Coefficient explains the impact of changes in an independent variable on the dependent variable.

Y = f(X, beta) Regression Models are generally denoted by this equation.

Where, Y is the dependent variable

X is the independent variable

beta is the unknown coefficient

Widely used in prediction and forecasting

Types of regression models

Regression Models

|-------Uni-variate

| |-------Linear

| | |------Simple

| | \------Multiple

| \-------Non Linear

\-------Multivariate

|-------Linear

\-------Non Linear

In Uni-variate Models the response variable is affected by just one predictor variable. It is the simplest from of statistical analysis.
Correspondingly the Multivariate Models refer to models where the response variable is affected by more than one predictor variable.
The Uni-variate and Multivariate Models can be further classified as Linear and Non-Linear Models.
In the linear model the model is fitted with straight line else is considered as non-linear models.
The Uni-variate Linear Model is further divided into Simple and Multiple. Usually more than one independent variable have the influence on dependent variable. When one independent variable used in a regression is called a simple regression. When two or more independent variable are used is called a multiple regression.

4.3 Simple Linear Regression

It's a common technique to determine how one variable of interest is affected by another.
It is used for three main purposes:

For describing the linear dependence of one variable on the other.
For prediction of values of other variable from the one which has more data.
Correction of linear dependence of one variable on the other.

A line is fitted through the group of plotted data.
The distance of the plotted points from the line gives the residual value.
The residual value is a discrepancy between the actual and the predicted value.
The procedure to find the best fit is called the least-squares method.

Linear Regression Model

The equation that represents how an independent variable is related to a dependent variable and an error term is regression model.

y = B0 + B1 x + e

Where, B0 and B1 are called parameters of the model, and e is a random variable called error term.

Getting the estimates of B0 and B1, i.e. E(Y|X) means finding the best straight line that can be drawn through the scatter plot Y vs X. This is done by Least Square(LS) estimates.

Simple Linear Regression - Graphical understanding.

Diagram here depicts a graphical plotting of linear regression.

The points in blue are the observed value of Y for the corresponding x values.
The straight line is the linear model defined by linear regression model equation discussed in previous section.
B1 is the slope of the equation i.e. with one unit of change x y changes by B1. If the value is positive it means that x and y are positively correlated and if it is negative then the two variables are negatively correlated.
B0 is the value of Y at X = 0 i.e. the intercept of the equation.
The point of the straight line at X equals X0 is the predicted value of Y at X = X0. The difference between the observed value and predicted value is the residual or error term.

Process to build a regression model

Identify the target variable.
Identify the predictors
Data collection
Decide the relationship (Simple Data Analysis and Scatter plot is done to identify this)
Fit the model (derive a mathematical equation to to help predicting the response variable)
Evaluate the model (To check the efficiency of the fitted model in predicting the outcomes)

Linear Regression Model Assumptions

The predictor variable x is non-linear
The error term e is random
Error term follows normal distribution
Standard Deviation of error is independent of x.
The data being used to estimate the parameters should be independent of each other
If any of the above assumptions are violated, modelling procedure must be modified.

4.4 Coefficient of Determination - R^2

A measure of goodness of fit - How well your model does fit the data?

We will now look at how different values of R are interpreted.

In the first figure the line is perfectly horizontal and the R^2 is 0, which implies no linear relationship.
In the second figure the R^2 value is -1, which implies a negative linear relationship.
In the last figure the R^2 value is +1, which implies a positive linear relationship.

4.5 How good is the model?

Based on R^2 value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
The differences between observations that are not explained by the model is the error term or residual.
Suppose we have a case in which R^2 value is 0.74. This means that 74% of variance in the values of the dependent variable is explained by the model and the remaining 26% which is not explained is its residual or error term.

4.6 How to find linear regression equation?

SUBJECT	AGE(X)	GLUCOSE LEVEL(Y)	XY	X^2	Y^2
1	43	99	4257	1849	9801
2	21	65	1365	441	4225
3	25	79	1975	625	6241
4	42	75	3150	1764	5625
5	57	87	4959	3249	7569
6	59	81	4779	3481	6561
Sum	247	486	20485	11409	40022

We will now look at an example to find linear regression equation.

In our example we will look at two variable. Age as X and corresponding Glucose Level as Y.

We are going to see how to make a linear regression line for these variable.

The general equation for regression analysis is Y' = A + BX.

In order to calculate this equation manually we need to calculate three more variables.

XY, X^2 and Y^2 values. Then we calculate the sigma values by summing up the values of all these variables.

ΣX = 247

ΣY = 486

Σ(XY) = 20485

ΣX^2 = 11409

ΣY^2 = 40022

Here in order to derive the regression equation we need to find the intercept A value and coefficient of independent value B.

To calculate the value of intercept A use the formula.

A = [ ΣY.Σ(X^2) - ΣX.Σ(XY) ] / n.Σ(X^2) - (ΣX)^2

= ( 486 * 11409 - 247 * 20485) / (6 * 11409 - 247^2)

= 65.14157152

Then we need to calcualte the coefficient of dependent value B.

To calcualte this value B use the formula.

B = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)

= ( 6 * 20485 - 247 * 486) / (6 * 11409 - 247^2)

= 0.385224983

From this values A and B we can obtain the final regression equation:

Y' = A + BX

= 65.14157152 + 0.385224983 * X

From the above equation we can calculate the future Y value by substituting the future X value.

Also:

Intercept(a) = (ΣY - b(ΣX)) / N

4.7 Commands to perform linear regression in R

R provides comprehensive support for linear regression. In order to perform linear regression model in R we need to use the lm() function to fit linear model. We will use R's inbuilt help to check on the functionality of lm().

Refer: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

This lm functionality contains various arguments.

lm(formula, data, subset, weights, na.action,

method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,

singular.ok = TRUE, contrasts = NULL, offset, ...)

But we are going to see most frequently used arguments in regression analysis.

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

The formula in the lm model is specified using the tilda(~) symbol. The syntax is response ~ terms. Response are the numeric dependent vectors and the terms are dependent variables or the independent vectors for response. We can also use plus(+) symbol to include more predictor terms.

lm model has various components in it. Two of the most important components are coefficients and residuals.

coefficients - named vector of coefficients

residual - the residual, that is response minus fitted values.

Demo: how to draw the regression line for the given data set in R.

The scatter plot is generally used to plot the quantitative variables and display them as geometric points inside a Cartesian diagram.

To perform this let us take faithful data set which is already loaded in R.

> head (faithful)

eruptions waiting

1 3.600 79

2 1.800 54

3 3.333 74

4 2.283 62

5 4.533 85

6 2.883 55

The data set contains two variables erruptions and waiting. Erruption column specifies time of an erruption in minutes and waiting specifies the time between two erruptions.

First let us make a scatter plot of erruption duration and waiting intervals and then try to find out if there is any relationship between the variables.

> attach (faithful)

> plot (eruptions, waiting, xlab="Erruption Duration", ylab="Time Waited")

The result plot shows that there is a positive linear relationship between eruption and time waited.

This depicts the fact that if time waited is high then the eruption duration will also be higher.

We can also generate a linear regression model with the lm functionality and then draw a trendline using abline functionality.

> abline(lm(waiting~eruptions))

4.8 Linear Regression to Predict Sales : Case Study 1

Chip Chops company is a global ice-cream manufacturing company specializing in fruit and nut flavored ice-cream. They have a very wide customer base spanning almost all parts of the world. They want to find some useful insights regarding individual and social consumer consumption patterns so that they can make changes to their business that may yield them a higher consumption rate of ice-cream by individuals. They hired an R expert Richard to work on this situation and asked him to come up with useful insights that in turn may help build their profitability and increase consumer consumption rate. The data was readily available with the company and the firm wanted Richard to work on a sample data that give them insights before trusting him with the project. So now Richard has a sample set of data from the company and the data hold 4 variables. The first variable specifies the consumption of ice-cream per person and it is a numeric variable. The second variable specifies the average family income per week in US dollars. The third variable tells about the price of ice-cream per unit. The fourth variable specifies the average temperature unit that is experienced in the city in terms of Fahrenheit. The companies aim is to extract more customers and increase the consumption rate and ultimately increase their sales profits. To find this Richard needs to find the important factors causing impact on the consumption rate.

In order to find this relationship he wants to perform a linear regression. This regression analysis will help us in finding the relationship between the factors and help us in predicting the future consumer consumption rate. In linear regression analysis as we know there is the dependent variable and various independent variables. In our case we need to find the pattern and consumption rate so it is being declared as dependent variable and factors which it relies on are the independent variables. The remaining three variables are the independent variables. And in this case they are average income, price and temperature. Now we will perform the linear regression to see how they are affecting consumption rate.

In order to perform linear regression in R, the first step you need to do is load your data set in R workspace.
> data <- read.csv("E:/RWD/SimpliLearn/Video2-Icecream.csv")
> View(data)

So we will now see how to perform linear regression using lm function to find the dependent and independent variable and store the result in a new variable. Here in our example it is given as

> analysis <- lm(cons~income+price+temp,data=data)
> summary(analysis)

Call:
lm(formula = cons ~ income + price + temp, data = data)

Residuals:
Min 1Q Median 3Q Max
-0.065302 -0.011873 0.002737 0.015953 0.078986

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1973151 0.2702162 0.730 0.47179
income 0.0033078 0.0011714 2.824 0.00899 **
price -1.0444140 0.8343573 -1.252 0.22180
temp 0.0034584 0.0004455 7.762 3.1e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03683 on 26 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.6866
F-statistic: 22.17 on 3 and 26 DF, p-value: 2.451e-07

>

The results provide the call values i.e the formula that is been used for regression analysis is being displayed.
Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.
The important interpretation is always made from the term coefficients. This provides the intercept values and estimate values of all independent variables.
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables.
We also predict the future values from this equation.
And the accuracy of these values always relies on residual term that is being generated.
In the residual term the values are less. If the residual term values are less, the accuracy of the predicted values will be more.
Here from the obtained coefficients we can draw the regression line as:

cons = 0.19 + 0.0033 * income - 1.0444 * price + 0.0034 * temp + residuals

From this regression line we can predict the future line by defining the income, price and temperature with its residual terms.

The results also displays the P-value from the low values of P for the income and temp variables and the significance codes we can see that these two variables are more significant. From this and the positive coefficient values we can conclude that the income of the people is higher and the temperature of the day is more then there will be more sales of ice-cream and the consumption rate of individuals will also get increased. From the other row of the price we can deduce that the value price is negative then their consumption rate is higher i.e. if the price of the ice-cream goes down the demand for ice-cream goes higher and people tend to buy it more.

The residual standard error i.e. being calculated is about 0.036. And such a low value will not have much impact on the predicted sales.
The results also provide the Multiple R-Squared and Adjusted R Squared values.
Multiple R-Squared values is obtained about 0.71 i.e. 71% of relationship between the dependent variable can be explained with the three given factors.

Finally he F-statistics and the P-values are displayed. The P values i.e. being obtained is less than 0.05 So we can reject the null hypothesis and conclude that there is some linear relationship between the dependent and independent variables.

We can summarize the results as follows:
If price of ice-cream gets lowered the consumption rate of individuals will increase.
If the income of people and temperature of particular locality is higher then obviously the demand for ice-cream will increase and the consumption rate of ice-cream per individual will go higher.
These insights will help the company authority to make easy decisions on the pricing and henceforth their profitable range could be extended.

4.9 Linear Regression: Case Study 2

Analytic startup is involved in analytics for reverse logistics. The people already employed by this company have different working experiences and they draw different salaries. The company decided to monitor the working skills of each employee and so they introduced a format called score board rating. In this format each employee is rewarded a score based on his or her working performance. The employee is rewarded higher score if the working performance for that period has been good. The company wants to analyse and confirm that if the higher experience and higher score card rates leads to employee higher amount of salary. In order to analyze this we need to have the data regarding salary, experience and score rating of individual employees. Since it is a relatively new startup they have only 20 employees and we have the details of all the employees. The first variable specifies the years of relevant work experience of the employee. The second variables specifies the score rating that the employee earned by his performance. The third variable specifies the amount that the employee is drawing as the salary in thousands of dollars. In order to find the relationship between these variables we need to perform a linear regression model in R. By this we can establish relationship between variables and we can also predict the future values of the dependent variable by constructing a regression line.

Before performing regression analysis in R, we need to load the data set in R for which we are going to perform analysis:

> mydata <- read.csv("E:/RWD/SimpliLearn/Video3-rating.csv")

> head(mydata)

experience score salary

1 4 78 24.0

2 7 100 43.0

3 1 86 23.7

4 5 82 34.3

5 8 86 35.8

6 10 84 38.0

Here in our example we need to check whether the salary is depended upon the score card rating and work experience. So the dependent variable is salary and independent variable are score and experience.

> fit <- lm(salary~score+experience,data=mydata)

> summary(fit)

Call:

lm(formula = salary ~ score + experience, data = mydata)

Residuals:

Min 1Q Median 3Q Max

-4.3586 -1.4581 -0.0341 1.1862 4.9102

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.17394 6.15607 0.516 0.61279

score 0.25089 0.07735 3.243 0.00478 **

experience 1.40390 0.19857 7.070 1.88e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.419 on 17 degrees of freedom

Multiple R-squared: 0.8342, Adjusted R-squared: 0.8147

F-statistic: 42.76 on 2 and 17 DF, p-value: 2.328e-07

The result show the first line call which provides the formula which is being specified to perform regression analysis.

Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.

The important interpretation is always made from the term coefficients.
This provides the intercept values and estimates standard error, t-statistics and p-values of both experience and score (independent variables).
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables as

Salary = 3.173 + 1.403 * experience + 0.250 * score + residuals

Using this regression line we can predict the future values of salary by defining the experience and score rating of each employee.
The residual values obtained should be lower as otherwise it would affect the model.
If the residual term values are less, the accuracy of the predicted values will be more.
Since both the coefficients are positive we can also conclude from the given data that if both experience and score rating is high then the salary obtained by employee is also positive i.e. higher.
The standard error value obtained is also very low and so the deviation between predicted and actual value is also low.

This means that this model can predict the accurate values very easily.

The T-value obtained is higher and the P-value for both experience and score is very low and less than 0.05.

This shows that we need to reject the null hypothesis and conclude that there is some relationship between the dependent and independent variable. The significance codes also suggest the same.
Next the standard residual error and the degrees of freedom of those variables are also being displayed.

The results also contains the Multiple R-Squared and Adjusted R-Squared value

The R-Squared value that is being obtained about 0.81 suggest that there is a positive correlation between the dependent and independent factors. In other words the difference in the values of the salary can be explained for 81% of the cases with the given experience and rating factor. If independent variables move toward the positive trend obviously the dependent variable also move towards the positive trend.

In terms of regression equation we can make conclusions like the salary increase is expected to be additional 1403$ for each year of experience when the attribute score is at the same level.
The insight we are drawing from this analysis is that in this analytics company an employee with higher working experience and with higher score ratings then he would obviously draw higher salary. And correspondingly if his experience and score rating is low he would obviously draw less salary.

Thus confirming our earlier assumptions.

From this analysis we can also predict in future what salary should be provided for a newly joined employee of this company based on his or her individual working experience and the score rating.

4.10 Case Study - Classification using Linear Regression

By now we know several usage of linear regression and why it is one of the widely used models around. This case study will also delve how linear regression can also be used to do classification. Surprised! Let's see how? We will first install "mlbench" package in R.

install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes2)
head(PimaIndiansDiabetes2)

We already know how to install a package. After installating the package, it needs to be loaded with library() command. We will be using a database called "PimaIndiansDiabetes2" from the package. The data consists of the population if 392 women with Pima Indian Heritage who live in the area of Phoenix Arizona. They were tested for diabetes. They were tested for diabetes. The goal is to get a classification rule for the diagnosis of diabetes.
The variables are:
pregnant - number of times pregnant
glucose - plasma glucos concentration glucose tolerance test
pressure - diastolic blood pressure in mmhg.
triceps - triceps skin fold thickness in mm
insulin - 2 hours serum of insulin m units per ml.
mass - body mass index i.e. (weight in kgs. / height in meters square)
pedigree - diabetes pedigree function
age - age in years
diabetes - positive or negative

Here we need to classify the data into positive or negative.
Let us look at the complete data. We see that their are many values which can be removed or imputed.
For the sake of simplicity we will remove the values.

For that we use the na.omit() function.

pidna <- na.omit(PimaIndiansDiabetes2)

This reduces our dataset to 393. Now for linear regression we need numerical values. In our case all the columns have numerical values but diabetes column has categorical values. For this we will use the as.numeric() function. Also R gives a value 2 for POS and 1 for NEG so we give the code in such a way that we get 0 for negative and 1 for positive. We look at the data using the head() function to look at the diabetes data if converted to 0 or 1.

pidlm <- pidna
pidlm$diabetes < as.numeric(pidna$diabetes)-1
head(pidlm)

Now the data set can be used to apply linear regression.
From this data set we create a train set. We get the first 300 records and rest 92 records are kept into the variable called test.

test=pidlm[(301:392),]
train=pidlm[(1:300),]