Saturday, March 26, 2016

2 Statistical Concepts and their application in business

Objectives


  • Statistical Methods
  • Population and Samples
  • Sampling Plan and Sampling Methods
  • Descriptive Statistics and components
  • Probability Theory and Distributions
  • Confidence Interval
  • Concept of Tests of Significance
  • One Sided and Two Sided Hypothesis Testing
  • Various Tests of Significance
  • Non-Parametric Testing

2.1 Statisitical Methods

Statistics is a applied/business mathematics where we collect, organize, analyze, and interpret numerical facts

  • Descriptive Statistics
    • Measure of Central Tendency
    • Measure of dispersion
    • Sample
  • Inferential Statistics
    • Estimation
    • Hypothesis Testing
    • Population

Population and Samples

  • Population is any entire collection of objects or observations from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
  • For each population there are many possible samples
  • It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included. 
  • A sample is a group of units selected from larger group (the population). By studying the sample it is hoped to draw valid conclusion about the larger group.
  • A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling.

Developing a sampling plan 

  • Define the target population - in terms of number of elements, sampling unit,extent and time.
  • Select a sampling method - probability or non-probability sampling.
  • Obtain the sampling frame - must contain all potential factors.
  • Determination of sample size - for desired level of accuracy.
  • Choose data collection method - procedure to obtain the data.
  • Develop operational plan - which technique fits the best.
  • Execute operational plan - verification of specified procedure.

Sampling Techniques

  • Sampling
    • Probability
      • Simple Random - Purest form, where every member has equal probability of participating.
      • Systematic - Selection of elements from an ordered sampling frame.
      • Stratified - Dividing the member of the population into homogeneous sub groups (strata) before sampling.
      • Cluster - When natural but relatively homogeneous groupings are evident in a population.
    • Non-Probability
      • Convenience - Sample that is convenient to collect.
      • Judgemental - Sample is selected with a specific attribute based on the judgement of the research
      • Quota - Two stages of judgemental, first develope control categories or quotas, second collect sample based on convenience or judgement to fill in the quota.
      • Snowball - Initial group of respondants are selected usually at random or from contacts of existing customers.

Descriptive Statistics

Analyze data to extract meaningful information
Non-conclusive asit is only limited to the data being analyzed
Score Range  Number of Students
Below 40 
20
40-50
22
50-60
33
60-70
21
70-80
13
>80
5
Total
114

Histograms are used to graphically represent the data.

Measures of Central Tendencies

Measure of Central Tendency is a method of descriptive statistics which:
Identify with a single value
Also called measure of central location
It falls into the category of summary statistics
  • Mean 
  • Median
  • Mode 

Mean

Mean is the average of the numbers
A calculated "central" value of a set of numbers
Mean = (Sum of all numbers)/(Count of all numbers)
2 2 6 10
Mean = (2+2+6+10)/4= 20/10 = 5

Median

Median is the number in the middle
Number of values above and below median are same
3 3 4 5 7 8
3 3 (4 5) 7 8
Median = (4+5)/2 = 4.5

Mode

Calculate frequency of the occurence
Mode is the value that occurs often
Let us look at the below histogram (x=data, y=frequency)
    ^
Freq|  20  23 (33) 22  17  06  02
    ______________________________> Data
       00  01  02  03  04  05  06

It is clearly visible that number 02 has highest frequency in the data set.

When to use what?

Mean
  • The average is required for statistical analysis
  • The variable is continuous/ discrete
  • Mean is commonly used in case of quantitative variables

Median
  • The variable is discrete
  • There are abnormal extreme values/Non-normal data
  • The characteristics under study is qualitative

Mode
  • Least commonly used
  • The variable is discrete
  • There are abnormal extreme values
  • The characteristic under study is qualitative

Measure of Dispersion

Describe the amount of heterogeneity or variation within a distribution of scores
The spread or dispersion of a set of scores around some central value
Measure of Dispersion include:

  • Variance
  • Standard Deviation

Variance and Standard Deviation

Variance is an average of squared deviations about the mean

S Square = (Σ(x - ̅x)²)/n

Standard Deviation is the squared root of variance
           
               _________________
Std.Dev.(σ) = √( (Σ(x - ̅x)²)/n )

Example data: 2, 5, 5, 4, 6, 8
  • n = 6
  • Mean = ( 2 + 5 + 5 + 4 + 6 + 8 ) / 6 = 5
  • Variance =  ( (2-5)² + (5-5)² + (5-5)² + (4-5)² + (6-5)² + (8-5)² )/5
          
     =  ( (-3)² + (0)² + (0)² + (-1)² + (1)² + (3)² ) / 5
           =  ( 9 + 0 + 0 + 1 + 1 + 9 )/ 5 = 20/5 = 4
                          _________    __
  • Std. Deviation (σ) = √ Variance = √ 4 = 2

Case Study - Descriptive Statistics

Business Case: A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.

The telecom company surveyed 1000 of its customers on all the above services.

User Descriptive analysis to study customer spending to determine which services are most profitable.
N Valid N Min Max Mean Std. Dev.
Long Distance last month 1000.00 1000.00 0.90 99.95 11.72 10.36
Toll free last month 1000.00 475.00 0.00 173.00 13.27 16.90
Equipment last month 1000.00 386.00 0.00 77.70 14.21 19.07
Calling card last month 1000.00 678.00 0.00 109.25 13.78 14.08
Wireless last month 1000.00 296.00 0.00 111.95 11.58 19.72

From the above table we can make following insights:
  • On average, customer spend the most on equipment rental, but there is a lot of variation in the amount spent.
  • Customers with calling card service spend only slightly less, on average, than equipment rental customers, and there  is much less variation in the values.
  • The real problem here is that most customers don't have every service, so a lot of 0's are being counted. One solution to this problem is to treat 0's as missing values so that the analysis for each service become conditional on having that service.

2.2 Probability Theory

Probability is a branch of mathematics that deals with the uncertainty of an event happening in the future.
Probability values always occurs within a range of 0 and 1.

Probability of an event, P(E) = (No. of favorable occurrences)/(No. of possible occurrences)

Let us take a simple example of tossing an unbiased coin.
Probability of getting a head or tail is = 1/2

Assigning Probabilities

Classic Method - based on equally likely outcomes.
E.g. Rolling a dice.
The probability of each number 1, 2, 3, 4, 5, 6 occurring out of total 6 outcomes is 1/6.

Relative frequency method - based on experimentation or historical data.
E.g. A car agency has 5 cars. His past record as shown in the table shows his car used in past 60 days.

No. of Cars Used No. of days Probability
0 3 (3/60) = 0.05
1 10 (10/60) = 0.17
2 16 (16/60) = 0.27
3 15 (15/60) = 0.25
4 9 (9/60) = 0.15
5 7 (7/60) = 0.12
Subjective method - based on judgement.
E.g.: 75% chance that England will adopt to Euro currency by 2020.

Probability Distribution 

Probability distribution for a random variable gives information about how the probabilities are distributed over the values of that random variable. 
It's defined by f(x) which gives probability of each value.
E.g. Suppose we have sales data for AC sale in last 300 days.

Unit Sold No. of days Probability f(x)
0 10 (10/300) =  0.03
1 55 (55/300) =  0.18
2 150 (150/300) =  0.50
3 55 (55/300) =  0.18
4 25 (25/300) =  0.08
5 5 (5/300) =  0.02


Binomial Distribution

Binomial Distribution satisfies:

  • A fixed number of trials
  • Each trial is independent of the others
  • The probability of each outcome remains constant from trial to trial.

Example of binomial experiments
Tossing a coin 20 times, what is the probability of getting head 5 times?
Getting a diamond king from a pack of 52 cards.

Case Study - Binomial distribution

Example of binomial distribution: Amir buys a chocolate bar every day during a promotion that says one out of six chocolate bars has a gift coupon within.
Answer the following questions:

  1. What is the distribution of the number of chocolates with gift coupons in seven days?
  2. What is the probability that Amir gets no chocolates with gift coupons in seven days?
  3. Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?
  4. Amir buys a bar everyday for six weeks. What is probability that he gets at least three gift coupons?
  5. How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?

(Assume that the conditions of binomial distributions apply: the outcomes for Amir's purchases are independent, and the population of chocolate bars is effectively infinite.)

Steps:
Formula =  nCr pr qn-r
Where,
n is the no. of trials
r is the number of successful outcomes
p is the probability of success
q is the probability of failure

Other important formula include
p+q=1
q=1-p

Thus,
p=1/6
q=1-(1/6)=5/6


1. Distribution of number of chocolates with gift coupons in 7 days:

7Cr (1/6)r (5/6)7-r

2. Probability of failing 7 days:

P(x=0) = 7C0 (1/6)0 (5/6)7-0 (5/6)7

3. Probability of winning a coupon on 7th day: 1/6

4. The number of winning at least 3 wrappers in six weeks:
   P(X>=3) = 1 - P (X<=2)
           = 1 - (P(X=0)+P(X=1)+P(X=2))
           = 1 - (0.0005 + 0.0040 + 0.0163)
           = 0.979
5. Number of purchase days required so that probability of success is greater than 0.95:
   P(X>=1) >= 0.95 (As per Binomial Distribution)
   >> P(X=1) + P(X=2) + ... + P(X=6) >= 0.95 but since Sum of P(X=r) = 1 so,
   >> 1- P (X=0) >=0.95
   >> P(X=0) <=0.05
   >> (5/6)n <= 0.05 taking log both sides to solve this exponential quation
   >> n log(5/6) <= log (0.05)
   >> So, n >= 16.67
   >> that is n=17 days minimum.


Normal Distribution


Normal distribution

  • A Normal distribution is a theoretical model of the whole population
  • It is perfectly symmetrical about the central value; the mean mu represented by zero.
  • It is also called the bell curve.
  • The distribution is symmetric with mean 0 and std. dev. of 1.


Poisson distribution

Discrete probability distribution for events that happen randomly in time
Following conditions need to be satisfied:
  • The even results in a success of failure
  • The average number of successes, mu is known
  • Probability of success is proportional to the region/time.
  • Probability of success in an extremely small  region/time is almost zero.
  • Properties: Mean and variance is equal and denoted by mu.


Examples:

  • Average number of houses sold by a company is 5 per day. What is the probability that exactly 4 houses will be sold tomorrow?
  • Average number of births in a hospital is 2.1 births per hour. What is the probability that there will be exactly 6 births in the next 2 hours.

Skewness and Kurtosis

Skewness: Measure of deviation from symmetry
  • Difference between median and mode
  • Right or Left skewed
  • Skewness negative - more negative values (Left Skewed)
  • Skewness positive - more positive values (Right Skewed)

Kurtosis: measure of peakedness of the distribution
  • High Kurtosis - Tall peak, rapid decline in the tails.
  • Low Kurtosis - flat peaks, gradual decline in the tails.
  • Extreme Case - Uniform distribution.

Case Study: Skewness and Kurtosis

Skewness Kurtosis
N Statisitic Std. Error Statisitic Std. Error
Long Distance last month 1000 2.966 0.077 14.012 0.155
Toll free last month 475 3.465 0.112 26.735 0.224
Equipment last month 386 0.756 0.124 0.641 0.248
Calling card last month 678 2.15 0.094 7.572 0.187
Wireless last month 296 1.359 0.142 3.079 0.282

Equipment last month data is more accurate in nature and its SD is comparatively lower than the other measures.

Confidence interval

  • It's a rule for a population parameter to determine an interval that is likely to include the parameter based on the sample information.
  • Supposing that a random variable has been taken and the random samples were taken repeatedly from the population, certain percentage of interval contains unknown value.
  • In such case, if population is repeatedly sampled and intervals calculate in that fashion then 95% of interval contains true value of the unknown parameter.
  • This interval is then said to be 95% confident for the population proportion.
  • The upper and lower limits of the 95% confidence interval are confidence limits.
    Confidence levels is the probability value that is associated with a confidence level.
    The probability value is (1- alpha) This value is often represented as a percentage value.
    Say for a value of alpha = 0.05 the confidence level would be 0.95. This is a 95% confidence level.
Data Requirement:
  • Confidence level
  • Statistic
  • Margin of Error
  • Range of the confidence interval = sample statistics + margin of error.
  • The uncertainty associated with the confidence interval is specified by the confidence level.

How to construct a Confidence Interval

  • Identify a sample statistic - Choose the statistic that will be used to estimate a population parameter.
  • The statistic is generally the mean or the median or the mode in some cases.
  • Select the confidence level - It describes the uncertainty of sampling method.
  • Find the margin of error = Critical Value * Standard error of statistic.
  • Specify the confidence interval - The range of the confidence interval is defined by the following equation.
  • Confidence interval = sample statics +/- Margin of error.
    e.g. Margin of error = 1.86  and Sample statistic = 150
        Confidence interval = (150 - 1.86) to (150 + 1.86)
        Confidence interval = 148.14  to 151.86

2.3 Tests of Significance 

  • Tests used in assessing the evidence in favor of or against a given assumption
  • Begins with a Null Hypothesis, Ho
  • Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
  • Two types of tests:
    • One sided tests
    • Two sided tests
  • Results decided by calculating the "p - value"
  • P value can be defined at the probability that the calculate test statistic can take extreme value as the absurd value given that the null hypothesis is true.
  • Interpretation:
    • If p-value is less than the significance level alpha, reject the null hypothesis.
    • General values of alpha are 0.05, 0.01.
  • General Assumptions:
    • The distribution is almost normal
    • The sample in the distribution have almost unequal variances.

One sided hypothesis testing

  • Muo = null value
  • Null hypothesis Mu = Muo
  • Alternative hypothesis: Mu < Muo or Mu > Muo
Example: Given a sample of heights of 100 males in New York, decide whether the height has increased in general form a given average height of 5 feet 9 inches.
  • Null Value: Muo = 5 feet 9 inches
  • Null Hypothesis: Mu = 5.9
  • Alternative Hypothesis: Mu > 5.9
Using one of various hypothesis tests, calculate "p-value" and reject null hypothesis if p-value is less than 0.05.

Two sided hypothesis testing

  • Muo = null value
  • Null Hypothesis: Mu = Muo
  • Alternative hypothesis: Mu <> Muo
Example: given a sample heights of 100 males in New York, decide whether the height has increased/decreased in general form a given average height of 5 feet 9 inches.
  • Null Value = Muo = 5.9 
  • Alternative Hypothesis = Mu <> 5.9
Using one of various hypothesis tests, calculate p-value and reject null hypothesis if p-value is less than 0.05.

2.4 Tests of Significance

  • One Sample z test- The Z test is used to compare the mean with the given standard
  • Two Sample z test - The Z test is used to compare the means of two groups.
    The standard deviation need not be known to calculate the Z statistics.
    The Z test is generally used when the number of samples is greater than 30.
  • T test 
  • The t test is used with mean statistics as well but to calculate the t statistic the standard deviation must be known the test is preferred if the number of samples is less than 30. As earlier the t test can be one sample two sample or paired t tests.
  • One Sample t test - 
  • Two Sample t test - When the compared groups are independent. e.g.  To compare the marks or students of two different schools.
  • Paired t test - When the compared groups are paired. To compare the marks of students of same schools before and after a training class.
  • Chi-Squared test - For goodness of fit is used to test if there is a different between the observed values and the expected values according to a particular hypothesis.
  • F test - Annalysis of Variance (ANNOVA) - To compare variances of two or more groups. The mostly used f test is ANNOVA.
  • F test - Regression - lesser used is the regression analysis.
In all the analysis tests the null hypothesis states that there is no difference between mean or variances and the alternative hypothesis suggests otherwise.

Chi-Squared Tests

Compare the observed results against an expected result based on a hypothesis
Steps:
  • State the null hypothesis
  • Prepare the contingency table for the variable
  • Determite the expected results
  • Calculate the chi-squared values
  • Calculate the degree of freedom
  • Based on the above, calculate the p-value
  • If p-value <0.05, reject the null hypothesis
Test of independence:
  • Verify if two variables are independent
  • Same steps as above

Case Study - Chi Squared Test

A city has a newly opened nuclear plant, and there are families staying dangerously close to the plant. A health safety officer wants to take this case up to provide relocation for the families that live in the surrounding area. to make a strong case, he wants to prove with numbers that an exposure to radiation levels is leading to an increase in diseased population. He formulates a contingency table of exposure and disease.

Exposure Disease Yes Disease No Total
Yes 37 13 50
No 17 53 70
Total 54 66 120

Does the data suggest an association between the disease and exposure?

Steps:
  • Calculate the number of individuals of exposed and unexposed groups expected in each disease category (yes or no) if the probabilities were the same.
  • If there were no effect of exposure, the probabilities should be same and the chi-squared statistics would have a very low value.
Proportion of population exposed = (50/120)=0.42
Proportion of population not exposed = (70/120)=0.58

Thus, expected values:
Popolation with disease = 54
Exposure Yes: 54 * 0.42 = 22.5
Exposure No: 54 * 0.58 = 31.5

Population without disease = 66
Exposure Yes: 66 * 0.42 = 27.5
Exposure No: 66 * 0.58 = 38.5

Exposure Disease Yes Disease No Total Total Proportion
Yes Actual 37 13 50 50/120 = 0.42
Yes Expected 54 * 0.42 = 22.5  66 * 0.42 = 27.5
No 17 53 70 70/120 = 0.58
No Expected 54 * 0.58 = 31.5 66 * 0.58 = 38.5
Total 54 66 120

  • Calculate the Chi-Squared statistic
X^2 = Summation of  [(Observed Freq. - Expected Freq.)^2/ Expected Freq]
= ((37-22.5)^2 / 22.5) + ((13-27.5)^2 / 27.5) + ((17-31.5)^2 / 31.5) + ((53-38.5)^2 / 38.5) 
= 29.1
  • Calculate the degree of freedom:
df = (Number of rows -1) x (Number of columns -1)
df = (2-1) x (2-1)
df = 1
  • Calculate the p-value from the chi-squared table(found online).
    For Chi-Squared value 29.1 and degree of freedom =1, from the table, p-value is < 0.001
  • Interpretation: There is 0.001 chance of obtaining such discrepancy between expected and observed values if there is no association.

ANNOVA

  • Analysis of Variance - used to compare more than two groups
  • Extension of the independent t-tests
  • Factor variable - variable defining the groups
  • Response variable - variable being compared
  • One way ANNOVA
    • Groups of a single variable
    • E.g.: Is there a difference in student's marks based on the row he is seated - front / middle / back?
  • Two way ANNOVA
    • Two independent variables
    • E.g.: Does the race and gender affect a person's yearly income?

Case Study - One way ANNOVA


  • Marks obtained in the same subject by three students belonging to three different schools are given below.
  • Does the data suggest any association between school and marks?
School A B C
Marks 1 82 83 38
Marks 2 83 78 59
Marks 3 97 68 55

The basic idea in ANNOVA: Partition the total variation in the data into the variation between groups and variation between groups.
Steps:

  • Calcaute the means

School A: mean(82, 83, 97) = 87.3
School B: mean(83, 78, 68) = 76.3
School C: mean(38, 59, 55) = 50.6


  • Calcualte the grand mean

Grand: mean(82, 83, 97, 83, 78, 68, 38, 59, 55) = 71.4


  • Calculating the variations

Sum of Squared Deviations about the grand mean, across all observed values: SStotal = 2630.2
Sum of Squared Deviations of group mean about the grand mean - three group mean against the grand mean: SSbetween=2124.2
Sum of Squared Deviations of observations within a group about their group mean; added across all groups: SSwithin=506



  • Calculate the degree of freedom for every variance:

df_total = number of observations -1 = 9-1 = 8
df_between = number of groups -1 = 3 -1 = 2
df_within = number of observations - number of groups = 6


  • Calculate the Mean Squared Variances

Mean Suared variance between group MS_between = SS_between / df_between = 2124/2 = 1062
Mean Suared variance within group MS_within = SS_within / df_within = 506/6 = 84.3


  • Calculate the f-statistics

F-value = MS_between/MS_within = 1062.1/84.3 = 12.59


  • Calculate the p-value from the F-table

P-value for given f-value 12.59 and degree of freedom 2 and 6 is 0.007


  • Conclusion: since the p-value is less than alpha, we can conclude by rejecting the null hypothesis, that there is a difference in the marks obtained by students belonging to different groups.

2.5 Non Parametric Testing

  • Referred to as "distribution free", as they don't involve making assumptions of any data.
  • They have lower power than the parametric tests and hence are always given the second preference after the parametric tests
  • These tests are typically focused on median rather than mean
  • They involve straight forward procedures like counting and ordering
  • There are at least one non-parametric test done for each parametric test and are classified into following categories.
    • Tests of differences between groups (independent samples)
    • Tests of differences between variables (dependent variables)
    • Tests of relationship between variables
One usually computes the correlation coefficient.
Non parametric equivalence to the standard correlation coefficient are 
  • Spearman's R
  • Kendall's Tau
  • Coefficient Gamma
Appropriate non-parametric testing for testing the relationship between the two variables are the chi-squared tests, the pi coefficient and the fisher exact test. In addition a simultaneous test for relationship between multiple cases is available. Kendall coefficient of concordance. This test is often used to express the inter-relative  agreement among independent judges who are rating ranking the same simulate

Non Parametric Tests

Tests Parametric Non Parametric
One Qualitative Response Variable One Sample Test Sign Test
One Qualitative Response Variable - Two Values from Paired Samples Paired Sample T - test Wilcoxon Signed Rank Test
One Qualitative Response Variable - One Qualitative Independent Variable with Two Groups Two Independent Sample T - test Wilcoxon Rank Sum or Mann Whitney Test
One Qualitative Response Variable - One Qualitative Independent Variable with Three or more Groups ANNOVA Kruskall Wallis

Correlation

Measure of association between variables
Positive and negagive correlation, ranging between +1 and -1
A value of +1 or positive correlation applies that if the value of independent variable increases the value of response variable also increases.
Similarly, a value of -1 or negative correlation applies that if the value of independent variable increases the value of response variable decreases.

Positive Correlation Example:
Earning and expenditure - more a person earns more he/she spends.
Negative Correlation Example:
Speed and time - As the speed of the vehicle increases the time taken to cover a given distance decreases.

Parametric - normal distribution and hogeneous variance.
Pearson correlation
Non Parametric - no assumption, nominal variable
Spearman correlation

Correlation Coefficient

r: correlation coefficient
-1: Perfectly Negative
+1: Perfectly Positive
0 - 0.2 : No or very weak association
0.2 - 0.4 : Weak association
0.4 - 0.6 : Moderate association
0.6 - 0.8 : Strong association
0.8 - 1 : Very strong to perfect association

Summary

  • Overview of Statistical Methods
  • Population, Samples &  Sampling Plan and Sampling Methods
  • Descriptive Statistics - Measure of Central Tendency and Measure of Dispersion
  • Probability Theory and Distributions
  • Confidence Interval
  • What are Tests of Significance
  • The process flow of hypothesis testing
  • One Sided and Two Sided Hypothesis Testing
  • Various Tests used in calculating p-value
  • What is Non-Parametric Testing and why it is used.
  • Non-parametric alternatives for the usual tests of significance


Friday, March 25, 2016

1 Business Analytics

Data Analytics

Analytics is the science of analysis where statistics, data mining, computer technology, etc.are used for analysis.
Analysis is the process of breaking down a complex object or data into simpler forms or more compact or better data for understanding.

Analysis is the science of wisely acquiring meaningful results from given data using various methods and technologies.
Aims at discovering patterns of variation from the given data.
It helps to understand the future from past data and uncertainty related to business.
It's a sophisticated process that uses statistics, mathematics and economics models to predict the future and prescribe strategies.
The processes include:

  • Gather Data
  • Organize Data
  • Analyze Data

Stages of Analytics:

  • Descriptive Analytics (Information) ~ How many students dropped out last year?
  • Diagnostic Analytics (Insight) ~ Why has the drop-out rate increased in the last one year?
  • Predictive Analytics (Insight) ~ Which students are more likely to drop-out?
  • Prescriptive Analytics (Decision) ~ Which student should I target to keep from dropping out?

Popular Tools used in Analytics

  • R
  • Revolution R
  • R Studio
  • Tableau
  • SAP HANA
  • Weka
  • KXEN
  • SAS

Role of a data scientist:



  • Inquisitive, can look at data and spot trends.
  • Come out with unrevealed stories hidden in data that helps in creating more useful insights and help solving business problems.
  • Work in sync with application developer to grant relevant data for analysis.
  • Make an analytical plan in such a way that the results satisfy the business needs.
  • Come up with an effective data mining architecture and prepare suitable models
  • Respond to and resolve data mining performance issues
  • Generate reports that are affordable from business perspective

Data Analytics Methodology

  • Discovery 
  • Data Preparing
  • Model Planning
  • Deliver Results
  • Put into use

Problem Definition:

  • What is the problem?
  • What it is not?
  • We have this problem because?
  • We don't have a solution because?

Defining a problem:

  • State the problem in a general way.
  • Understand the nature of the problem
  • Survey the available literature
  • Go for discussions for developing ideas
  • Rephrase the problem into a working proposition.

Types of Data

  • Qualitative Data
    • Data expressed as groups or categories
    • Descriptive Data
    • e.g. Dividing a population into high medium and low height groups.
  • Quantitative Data
    • Data expressed as numbers
    • Definitive Data
    • e.g.The height of a person

Summarizing Data

  • Summarizing is the process of converting huge amounts of raw data into a format that can be easily analyzed.
  • Summaries differ on type of data; and can be descriptive or graphical
  • Numeric Data - Descriptive
    • Mean
    • Median
    • Mode
  • Numeric Data - Graphical
    • Box Plots
  • Categorical Data - Descriptive
    • Frequency distribution tables
  • Categorical Data - Graphical
    • Bar Charts
    • Histogram

Data Collection

  • Collect Relevant Data:
    Process of collecting relevant data that aids in solving the problem statement
  • Categorize the Data:
    Data Collection process needs to be defined and systematic.
  • Organize the Data:
    Observations need to be recorded and organized for optimal usefulness.

Data Collection Methods

Data Collection Methods fall broadly into two categories Primary and Secondary
  • Primary
    • Observation - Measuring the data and various attributes
    • Experiment - Subjects are divided into groups
    • Surveys - Questions and Interviews help in reporting feedback and help is studying characteristics of the population.
  • Secondary
    • Data which has already been gathered before the study and is available as already published facts and reports.

Data Dictionary

A Data Dictionary is a file that describes the structure of the database itself.
It includes details like:
  • Number of records
  • Name of each field
  • Characteristics of each field
  • Description of each field
  • Relationship between different fields
It helps in analyzing different data variables and their relationship between each other.

Outliers and their treatment

  • Outliers is a point or an observation that deviates significantly from the other observations.
  • Occurs due to experimental errors or "special circumstances"
  • Outlier detection tests to check for outliers
  • Outlier treatment
    • Retention
    • Exclusion
    • Other treatment methods

Sunday, March 13, 2016

4 Predictive Modeling Techniques

4.1 Predictive Modeling Techniques

Objectives

  • Understand regression analysis and types of regression models
  • Know and build a simple linear regression model
  • Understand and develop a logistic regression model
  • Learn cluster analysis, types and methods to form clusters
  • Know time series and its components
  • Decompose Seasonal and non-seasonal time series
  • Understand different exponential smoothing methods
  • Know the advantages and disadvantages of exponential smoothing
  • Understand the concept of white noise and correlogram
  • Apply different time series analysis like Box Jenkins, AR, MA, ARMA etc.
  • Understand all the analysis techniques with case studies

4.2 Regression Analysis

Regression analysis mainly focuses on finding a relationship between a dependent variable variable and one or more independent variable.
Predict the value of a dependent variable based on one or more independent variables.
Coefficient explains the impact of changes in an independent variable on the dependent variable.
Y = f(X, beta) Regression Models are generally denoted by this equation.
Where, Y is the dependent variable
X is the independent variable
beta is the unknown coefficient
Widely used in prediction and forecasting

Types of regression models

Regression Models
|-------Uni-variate
| |-------Linear
| | |------Simple
| | \------Multiple
| \-------Non Linear
\-------Multivariate
|-------Linear
\-------Non Linear
  • In Uni-variate Models the response variable is affected by just one predictor variable. It is the simplest from of statistical analysis.
  • Correspondingly the Multivariate Models refer to models where the response variable is affected by more than one predictor variable.
  • The Uni-variate and Multivariate Models can be further classified as Linear and Non-Linear Models.
  • In the linear model the model is fitted with straight line else is considered as non-linear models.
  • The Uni-variate Linear Model is further divided into Simple and Multiple. Usually more than one independent variable have the influence on dependent variable. When one independent variable used in a regression is called a simple regression. When two or more independent variable are used is called a multiple regression. 

4.3 Simple Linear Regression

  • It's a common technique to determine how one variable of interest is affected by another.
  • It is used for three main purposes:
    • For describing the linear dependence of one variable on the other.
    • For prediction of values of other variable from the one which has more data.
    • Correction of linear dependence of one variable on the other.
  • A line is fitted through the group of plotted data.
  • The distance of the plotted points from the line gives the residual value.
  • The residual value is a discrepancy between the actual and the predicted value.
  • The procedure to find the best fit is called the least-squares method.

Linear Regression Model

The equation that represents how an independent variable is related to a dependent variable and an error term is regression model.
y = B0 + B1 x + e
Where, B0 and B1 are called parameters of the model, and e is a random variable called error term.
Getting the estimates of B0 and B1, i.e. E(Y|X) means finding the best straight line that can be drawn through the scatter plot Y vs X. This is done by Least Square(LS) estimates.

Simple Linear Regression - Graphical understanding.


Diagram here depicts a graphical plotting of linear regression. 
  • The points in blue are the observed value of Y for the corresponding x values. 
  • The straight line is the linear model defined by linear regression model equation discussed in previous section. 
  • B1 is the slope of the equation i.e. with one unit of change x y changes by B1. If the value is positive it means that x and y are positively correlated and if it is negative then the two variables are negatively correlated. 
  • B0 is the value of Y at X = 0 i.e. the intercept of the equation. 
  • The point of the straight line at X equals X0 is the predicted value of Y at X = X0. The difference between the observed value and predicted value is the residual or error term.

Process to build a regression model

  • Identify the target variable.
  • Identify the predictors
  • Data collection
  • Decide the relationship (Simple Data Analysis and Scatter plot is done to identify this)
  • Fit the model (derive a mathematical equation to to help predicting the response variable)
  • Evaluate the model (To check the efficiency of the fitted model in predicting the outcomes)

Linear Regression Model Assumptions

  • The predictor variable x is non-linear
  • The error term e is random
  • Error term follows normal distribution
  • Standard Deviation of error is independent of x.
  • The data being used to estimate the parameters should be independent of each other
  • If any of the above assumptions are violated, modelling procedure must be modified.

4.4 Coefficient of Determination - R^2

A measure of goodness of fit - How well your model does fit the data?


We will now look at how different values of R are interpreted.

  • In the first figure the line is perfectly horizontal and the R^2 is 0, which implies no linear relationship.
  • In the second figure the R^2 value is -1, which implies a negative linear relationship.
  • In the last figure the R^2 value is +1, which implies a positive linear relationship.

4.5 How good is the model?

  • Based on R^2 value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
  • The differences between observations that are not explained by the model is the error term or residual.
  • Suppose we have a case in which R^2 value is 0.74. This means that 74% of variance in the values of the dependent variable is explained by the model and the remaining 26% which is not explained is its residual or error term.

4.6 How to find linear regression equation?

SUBJECT AGE(X) GLUCOSE LEVEL(Y) XY X^2 Y^2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Sum 247 486 20485 11409 40022
We will now look at an example to find linear regression equation. 
In our example we will look at two variable. Age as X and corresponding Glucose Level as Y.
We are going to see how to make a linear regression line for these variable.
The general equation for regression analysis is Y' = A + BX.
In order to calculate this equation manually we need to calculate three more variables.
XY, X^2 and Y^2 values. Then we calculate the sigma values by summing up the values of all these variables. 
ΣX = 247
ΣY = 486
Σ(XY) = 20485
ΣX^2 = 11409
ΣY^2 = 40022

Here in order to derive the regression equation we need to find the intercept A value and coefficient of independent value B.
To calculate the value of intercept A use the formula.
A = [ ΣY.Σ(X^2) - ΣX.Σ(XY) ] / n.Σ(X^2) - (ΣX)^2
= ( 486 * 11409 - 247 * 20485) / (6 * 11409 - 247^2)
= 65.14157152

Then we need to calcualte the coefficient of dependent value B.
To calcualte this value B use the formula. 

B = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2) 
= ( 6 * 20485 - 247 * 486) / (6 * 11409 - 247^2)
= 0.385224983

From this values A and B we can obtain the final regression equation:
Y' = A + BX
= 65.14157152 + 0.385224983 * X
From the above equation we can calculate the future Y value by substituting the future X value.
Also:
Intercept(a) = (ΣY - b(ΣX)) / N

4.7 Commands to perform linear regression in R

R provides comprehensive support for linear regression. In order to perform linear regression model in R we need to use the lm() function to fit linear model. We will use R's inbuilt help to check on the functionality of lm().
Refer: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html


lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

This lm functionality contains various arguments. 
lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

But we are going to see most frequently used arguments in regression analysis.
formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

The formula in the lm model is specified using the tilda(~) symbol. The syntax is response ~ terms. Response are the numeric dependent vectors and the terms are dependent variables or the independent vectors for response. We can also use plus(+) symbol to include more predictor terms. 

lm model has various components in it. Two of the most important components are coefficients and residuals.
coefficients - named vector of coefficients
residual - the residual, that is response minus fitted values.

Demo: how to draw the regression line for the given data set in R.

The scatter plot is generally used to plot the quantitative variables and display them as geometric points inside a Cartesian diagram.

To perform this let us take faithful data set which is already loaded in R.
> head (faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
The data set contains two variables erruptions and waiting. Erruption column specifies time of an erruption in minutes and waiting specifies the time between two erruptions.
First let us make a scatter plot of erruption duration and waiting intervals and then try to find out if there is any relationship between the variables.
> attach (faithful)
> plot (eruptions, waiting, xlab="Erruption Duration", ylab="Time Waited")

The result plot shows that there is a positive linear relationship between eruption and time waited.
This depicts the fact that if time waited is high then the eruption duration will also be higher.
We can also generate a linear regression model with the lm functionality and then draw a trendline using abline functionality. 

> abline(lm(waiting~eruptions))

4.8 Linear Regression to Predict Sales : Case Study 1

Chip Chops company is a global ice-cream manufacturing company specializing in fruit and nut flavored ice-cream. They have a very wide customer base spanning almost all parts of the world. They want to find some useful insights regarding individual and social consumer consumption patterns so that they can make changes to their business that may yield them a higher consumption rate of ice-cream by individuals. They hired an R expert Richard to work on this situation and asked him to come up with useful insights that in turn may help build their profitability and increase consumer consumption rate. The data was readily available with the company and the firm wanted Richard to work on a sample data that give them insights before trusting him with the project. So now Richard has a sample set of data from the company and the data hold 4 variables. The first variable specifies the consumption of ice-cream per person and it is a numeric variable. The second variable specifies the average family income per week in US dollars. The third variable tells about the price of ice-cream per unit. The fourth variable specifies the average temperature unit that is experienced in the city in terms of Fahrenheit. The companies aim is to extract more customers and increase the consumption rate and ultimately increase their sales profits. To find this Richard needs to find the important factors causing impact on the consumption rate.


In order to find this relationship he wants to perform a linear regression. This regression analysis will help us in finding the relationship between the factors and help us in predicting the future consumer consumption rate. In linear regression analysis as we know there is the dependent variable and various independent variables. In our case we need to find the pattern and consumption rate so it is being declared as dependent variable and factors which it relies on are the independent variables. The remaining three variables are the independent variables. And in this case they are average income, price and temperature. Now we will perform the linear regression to see how they are affecting consumption rate.

In order to perform linear regression in R, the first step you need to do is load your data set in R workspace.
> data <- read.csv("E:/RWD/SimpliLearn/Video2-Icecream.csv")
>   View(data)

So we will now see how to perform linear regression using lm function to find the dependent and independent variable and store the result in a new variable. Here in our example it is given as

> analysis <- lm(cons~income+price+temp,data=data)
> summary(analysis)

Call:
lm(formula = cons ~ income + price + temp, data = data)

Residuals:
      Min        1Q    Median        3Q       Max
-0.065302 -0.011873  0.002737  0.015953  0.078986

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.1973151  0.2702162   0.730  0.47179  
income       0.0033078  0.0011714   2.824  0.00899 **
price       -1.0444140  0.8343573  -1.252  0.22180  
temp         0.0034584  0.0004455   7.762  3.1e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03683 on 26 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.6866
F-statistic: 22.17 on 3 and 26 DF,  p-value: 2.451e-07

>

The results provide the call values i.e the formula that is been used for regression analysis is being displayed.
Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.
The important interpretation is always made from the term coefficients. This provides the intercept values and estimate values of all independent variables.
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables.
We also predict the future values from this equation.
And the accuracy of these values always relies on residual term that is being generated.
In the residual term the values are less. If the residual term values are less, the accuracy of the predicted values will be more.
Here from the obtained coefficients we can draw the regression line as:

cons = 0.19 + 0.0033 * income - 1.0444 * price + 0.0034 * temp + residuals

From this regression line we can predict the future line by defining the income, price and temperature with its residual terms.

The results also displays the P-value from the low values of P for the income and temp variables and the significance codes we can see that these two variables are more significant. From this and the positive coefficient values we can conclude that the income of the people is higher and the temperature of the day is more then there will be more sales of ice-cream and the consumption rate of individuals will also get increased. From the other row of the price we can deduce that the value price is negative then their consumption rate is higher i.e. if the price of the ice-cream goes down the demand for ice-cream goes higher and people tend to buy it more.

The residual standard error i.e. being calculated is about 0.036. And such a low value will not have much impact on the predicted sales.
The results also provide the Multiple R-Squared and Adjusted R Squared values.
Multiple R-Squared values is obtained about 0.71 i.e. 71% of relationship between the dependent variable can be explained with the three given factors.

Finally he F-statistics and the P-values are displayed. The P values i.e. being obtained is less than 0.05 So we can reject the null hypothesis and conclude that there is some linear relationship between the dependent and independent variables.

We can summarize the results as follows:
If price of ice-cream gets lowered the consumption rate of individuals will increase.
If the income of people and temperature of particular locality is higher then obviously the demand for ice-cream will increase and the consumption rate of ice-cream per individual will go higher.
These insights will help the company authority to make easy decisions on the pricing and henceforth their profitable range could be extended.
   

4.9 Linear Regression: Case Study 2

Analytic startup is involved in analytics for reverse logistics. The people already employed by this company have different working experiences and they draw different salaries. The company decided to monitor the working skills of each employee and so they introduced a format called score board rating. In this format each employee is rewarded a score based on his or her working performance. The employee is rewarded higher score if the working performance for that period has been good. The company wants to analyse and confirm that if the higher experience and higher score card rates leads to employee higher amount of salary. In order to analyze this we need to have the data regarding salary, experience and score rating of individual employees. Since it is a relatively new startup they have only 20 employees and we have the details of all the employees. The first variable specifies the years of relevant work experience of the employee. The second variables specifies the score rating that the employee earned by his performance. The third variable specifies the amount that the employee is drawing as the salary in thousands of dollars. In order to find the relationship between these variables we need to perform a linear regression model in R. By this we can establish relationship between variables and we can also predict the future values of the dependent variable by constructing a regression line. 

Before performing regression analysis in R, we need to load the data set in R for which we are going to perform analysis:
> mydata <- read.csv("E:/RWD/SimpliLearn/Video3-rating.csv")
> head(mydata)
  experience score salary
1          4    78   24.0
2          7   100   43.0
3          1    86   23.7
4          5    82   34.3
5          8    86   35.8
6         10    84   38.0

Here in our example we need to check whether the salary is depended upon the score card rating and work experience. So the dependent variable is salary and independent variable are score and experience.

> fit <- lm(salary~score+experience,data=mydata)
> summary(fit)

Call:
lm(formula = salary ~ score + experience, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3586 -1.4581 -0.0341  1.1862  4.9102 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.17394    6.15607   0.516  0.61279    
score        0.25089    0.07735   3.243  0.00478 ** 
experience   1.40390    0.19857   7.070 1.88e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.419 on 17 degrees of freedom
Multiple R-squared:  0.8342, Adjusted R-squared:  0.8147 
F-statistic: 42.76 on 2 and 17 DF,  p-value: 2.328e-07

>

The result show the first line call which provides the formula which is being specified to perform regression analysis.

Next comes the residuals term which provides the min 1stQ median 3rdQ and max values of the residual term.

The important interpretation is always made from the term coefficients.
This provides the intercept values and estimates standard error, t-statistics and p-values of both experience and score (independent variables).
From these results we also construct a regression line and based on these lines we can find the relationship between the dependent and independent variables as

Salary = 3.173 + 1.403 * experience + 0.250 * score + residuals

Using this regression line we can predict the future values of salary by defining the experience and score rating of each employee.
The residual values obtained should be lower as otherwise it would affect the model.
If the residual term values are less, the accuracy of the predicted values will be more.
Since both the coefficients are positive we can also conclude from the given data that if both experience and score rating is high then the salary obtained by employee is also positive i.e. higher.
The standard error value obtained is also very low and so the deviation between predicted and actual value is also low.

This means that this model can predict the accurate values very easily.

The T-value obtained is higher and the P-value for both experience and score is very low and less than 0.05.

This shows that we need to reject the null hypothesis and conclude that there is some relationship between the dependent and independent variable. The significance codes also suggest the same.
Next the standard residual error and the degrees of freedom of those variables are also being displayed.

The results also contains the Multiple R-Squared and Adjusted R-Squared value

The R-Squared value that is being obtained about 0.81 suggest that there is a positive correlation between the dependent and independent factors. In other words the difference in the values of the salary can be explained for 81% of the cases with the given experience and rating factor. If independent variables move toward the positive trend obviously the dependent variable also move towards the positive trend.

In terms of regression equation we can make conclusions like the salary increase is expected to be additional 1403$ for each year of experience when the attribute score is at the same level.
The insight we are drawing from this analysis is that in this analytics company an employee with higher working experience and with higher score ratings then he would obviously draw higher salary. And correspondingly if his experience and score rating is low he would obviously draw less salary.

Thus confirming our earlier assumptions.

From this analysis we can also predict in future what salary should be provided for a newly joined employee of this company based on his or her individual working experience and the score rating. 

4.10 Case Study - Classification using Linear Regression

By now we know several usage of linear regression and why it is one of the widely used models around. This case study will also delve how linear regression can also be used to do classification. Surprised! Let's see how? We will first install "mlbench" package in R.

install.packages("mlbench")
library(mlbench)
data(PimaIndiansDiabetes2)
head(PimaIndiansDiabetes2)


We already know how to install a package. After installating the package, it needs to be loaded with library() command. We will be using a database called "PimaIndiansDiabetes2" from the package. The data consists of the population if 392 women with Pima Indian Heritage who live in the area of Phoenix Arizona. They were tested for diabetes. They were tested for diabetes. The goal is to get a classification rule for the diagnosis of diabetes.
The variables are:
pregnant - number of times pregnant
glucose - plasma glucos concentration glucose tolerance test
pressure - diastolic blood pressure in mmhg.
triceps - triceps skin fold thickness in mm
insulin - 2 hours serum of insulin m units per ml.
mass - body mass index i.e. (weight in kgs. / height in meters square)
pedigree - diabetes pedigree function
age - age in years
diabetes - positive or negative

Here we need to classify the data into positive or negative.
Let us look at the complete data. We see that their are many values which can be removed or imputed.
For the sake of simplicity we will remove the values.

For that we use the na.omit() function.

pidna <- na.omit(PimaIndiansDiabetes2)

This reduces our dataset to 393. Now for linear regression we need numerical values. In our case all the columns have numerical values but diabetes column has categorical values. For this we will use the as.numeric() function. Also R gives a value 2 for POS and 1 for NEG so we give the code in such a way that we get 0 for negative and 1 for positive. We look at the data using the head() function to look at the diabetes data if converted to 0 or 1.

pidlm <- pidna
pidlm$diabetes < as.numeric(pidna$diabetes)-1
head(pidlm)

Now the data set can be used to apply linear regression.
From this data set we create a train set. We get the first 300 records and rest 92 records are kept into the variable called test.

test=pidlm[(301:392),]
train=pidlm[(1:300),]