Business Analytics using R: 3 Basic Analytic Techniques using R

Objectives

Get a basic introduction to R
Understand exploration of data
Explore data using R
Visualize data using R
Understand diagnostic analytics
Implement diagnostic analytics using R
Understand these concepts with the help of case studies

Introduction to R

Programming language for graphics and statistical computations
Available freely under the GNU public license
Used in data mining and statistical analysis
Included time series analysis, linear and non-linear modeling among others
Very active community and pacakage contributions
Very little programming knowledge necessary
Can be downloaded from http://www.r-project.org/
R-Studio is and IDE for programming in R that is freely available (optional)

Basic R

Packages

install.packages('package_name')
library(package_name)

Loading Data

data(dataset_name)
read and write functions
getwd() and setwd()
read and write functions use full path name
Example: read.csv("C:/Rtutorials/Sampledata.csv")

Assignment operator: "<-"
Help ?function_name

Data exploration using R

Basic functions for data exploration in R
Data stored as "data frames"
Data frames - tabular representation of data with rows and columns
There are other types of data as well, including tables, matrices, vectors and single values

Sample data frame (Sir Donald Fisher's Iris data set)

Sapal length	Sepal Width	Petal Length	Petal Width	Species
7.9	3.8	6.4	2	I.virginica
7.1	3	6.1	2.3	I.virginica
5.6	2.5	3.9	1.1	I.versicolor
5.6	2.8	4.9	2	I.virginica
5.5	4.2	1.4	0.2	I.setosa
5.5	3.5	1.4	0.2	I.setosa
7.1	3	5.9	2.1	I.virginica

The three species have 50 entries each forming total of 150 entries. Each row denotes features of a particular flower.

Viewing Data Frame

iris
head (iris)
tail (iris)
head (iris,2)
tail (iris,2)

Dimensions of data frame

View the dimensions of the data set

dim(iris)
[1] 150 5

View the number of columns

ncol(iris)
[1] 5

View the number of rows

nrow(iris)
[1] 150

Attribute of Data Frame

View column names/headers

names(iris)

View all attributes

attributes(iris)
class (iris)
[1] data.frame

class (iris$sepal.length)

Viewing Column data

View Column data
iris@Petal.Length
iris[,"Petal.Length"]

Viewing row data

Viewing particular row

iris[10:15,]

View particular rows of a single column

iris[10:15,"Petal.Length"]

Demo in R

install("forecast")

library("forecast")

data(iris) ... to load the data set iris

attach(iris) ... to save the data set iris in workspace

detach(iris) ... to detach the dataset iris from workspace

mydata<-iris ... we load the dataset into a user defined variable mydata

mydata=iris ... we can also use the equal sign instead

head(mydata) ... view the first 6 rows to check the structure of data set

CTRL + l ... clears the screen of current workspace

getwd() ... returns the current working directory we are in

setwd("c:/") ... to set the current working directory

data<-read.csv("c:/europe.csv",header=TRUE)

... to load a csv file into a user variable

head(data) ... check the structure of the data

data.frame(data)

... covert data into data frame

tail(data) ... returns last 6 rows

head (data,12)

... returns first 12 rows of data

dim(data) ... shows the dimension of data i.e. number of rows and columns

[1] 28 7

ncol(data) ... shows the number of columns

[1] 7

nrow(data) ... shows the number of rows

[1] 28

names(data) ... shows the column names

[1] "Country" "Area" "GDP" "Inflation" "Life.expect"

[6] "Military" "Pop.growth"

data$Country ... shows the data in column name 'Country'

data[,"GDP"] ... shows the data in column name 'GDP'

data[,3] ... shows the data in column number 3

data[7,] ... shows the data in row number 7

data[7:16,] ... shows the data from row number 7 to 16

Summarizing data in R

summary(data_frame)

summary(iris)

Summary function displays the following:
Min, 1st Qtr, Median, Mean, 3rd Qtr, Max Values for Numerical Data
For catagorical data like species it displays the table of different values and its frequencies
The same table can be displayed separately by using the following command:
table(dataframe$columnname)

table(iris$Species)

Same table can be displayed using the following command for a specific column.

summary(iris$Sepal.Lenght)

Individual Summaries

min(column_name)
max(column_name)
range(column_name) i.e. maximum-minimum
mean(column_name)
median(column_name)
IQR(column_name) i.e. Inter quartile range
sd(column_name) i.e. Standard Deviation
var(column_name) i.e. Variance

Subset Summary

aggregate() - function for column wise aggregation
Numerical summaries for a subset of data

aggregate(formula, data, function)

aggregate(.~Species, data = iris, FUN= mean)
Where,
Species is the column name for which aggregation is done
. indicates all columns
iris is the data set
mean is the function
aggregate(Sepal.Length~Species, data = iris, FUN= mean)
Species is the column name for which aggregation is done.
Sepal Length is the column which only will be aggregated.
iris is the data set
mean is the function
Apart from mean there are many other aggregate functions you can use like sum.

Data Visualization in R

plot() - generic function for plotting in R
The function can be used to plot a variety of graphs on a variety of data including data frames, time series and even vectors.
The plot function plots a scatter plot by default. Other plots can be created by using a type attribute. You can check the help section for more details on the plot function.
plot(iris)
Shown here is a plot of iris data. Plotting the data frame it creates a pair wise data plot for all the attributes in the data frame.

plot() function can be used to plot one variable against another

For example let us plot Sepal Length against Species

We will use a few optional attributes of the plot function.

main to specify the title of the plot

xlab the x axis label

ylab the y axis label

plot(iris$Sepal.Length,iris$Species,

mail = "IRIS Data",

xlab = "Sepal Length",

ylab = "Species")

Pie Charts

Pie charts to visualize the numerical proportions of different classes through sectors of the circle.

pie(table(iris$Species), main = "Iris Data by Species")

The table function is used to create a frequency table and then the pie function is called to create the chart of the table.
The main attribute is used to specify the title for the chart.

Here is an example chart showing the different species of the IRIS data. The circle is divided in the three equal sectors i.e. for the three species.

Bar Plots

Bar Plots are used to dipict values in vertical manner with the height being equivalent to the value that is been shown.

USPersonalExpenditure

barplot(USPersonalExpenditure, main = "US Personal Expenditure by Year", xlab = "Year", ylab = "Expenditure")

	1940	1945	1950	1955	1960
Food and Tobacco	22.2	44.5	59.6	73.2	86.8
Household Operation	10.5	15.5	29	36.5	46.2
Medical and Health	3.53	5.76	9.71	14	21.1
Personal Care	1.04	1.98	2.45	3.4	5.4
Private Education	0.341	0.974	1.8	2.6	3.64

Box Plot

Box plot are used to show the numerical data with their quartile ranges also called a box whisker plot. Boxes show the inter quartile region with the middle line equal to the median. The whiskers show the lower and upper quartiles. And the points show the outliers.

boxplot(Sepal.Length~Species, data = iris, main = "IRIS Data Set", xlab = "Species Type", ylab = "Sepal Lenght")

Histograms

Histograms are used to depict the frequency distribution data

islands data set
hist(iris$Sepal.Length, main = "IRIS Data Set", xlab = "Sepal Length", ylab = "Frequency")

In addition to these attributes plots can have other attributes such as:

x and y axis limits

color to points or bar or lines, etc.

They can be found on the help section for each plot.

Demo

plot(iris)
plot(x,y)
plot(iris$Sepal.Length, iris$Species)
pie(table(iris$Species))
barplot(table(iris$Species))
boxplot(Sepal.Length~Species,data=iris)
hist(iris$Sepal.Length)

Case Study - Data Exploration

Let now take a data set from Bank as an example for doing data exploration

Officer David from the sales department has been asked to do a study on the overall performance of the bank loans. As a part of investigation, David has asked his colleague to categorize the loans as seasoned or bad loans with respect to Age category.

The following data has been captured:

Age Group
Total number of loans
Number of good loans
Number of bad loans
Percentage of good loans
Percentage of bad loans

A look at the data:
The given table is of the bank loan data
The number of good loans and bad loans has been tabulated as per the age category.
There are 15 catagories of age in the table.
Given the total number of loans and number good and bad loans the percentage of good and bad loans has been tabulated.
Now we will look at different statistics and graphical visualization for the data.

Since it is a preloaded data in R we will not be using any read command however if you wish to save it to an excel file use the write.csv command:

The write.csv command takes two arguments the data frame name and the name of the file you wish to write it to.

> write.csv(mtcare,"MotorTrendsCarData.csv")

The file MotorTrendsCarData.csv will be saved in your working directory.

Since it is a preloaded data in R we will not be using any read command however if you wish to save it to an excel file use the write.csv command:

The write.csv command takes two arguments the data frame name and the name of the file you wish to write it to.

> write.csv(mtcare,"MotorTrendsCarData.csv")

The file MotorTrendsCarData.csv will be saved in your working directory

First let us view the data set.

>mtcars

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Let us find the class of the data set in R

> class(mtcars)

[1] "data.frame"

From the output we can infer that the data is stored as a data frame.

Let us confirm the number of rows and columns in the data set.

> dim (mtcars)

[1] 32 11

The output shows 32 and 11 with a space in between indicating 32 rows and 11 columns

Let us explore the data in detail

Use head(mtcars) to explore the first 6 rows

> head (mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Though we already established the types of the variables let us confirm it by looking at this data.

We can see that vs or am has either 0 or 1 in the data and thus has been rightly classified as catagorical data.

The number of cyl gears and carb are classified as discrete numerical values and rest of the data are continuous numeric format.

To look at all the types if cars

> attributes(mtcars)

$names

[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

$row.names

[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"

[5] "Hornet Sportabout" "Valiant" "Duster 360" "Merc 240D"

[9] "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE"

[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" "Lincoln Continental"

[17] "Chrysler Imperial" "Fiat 128" "Honda Civic" "Toyota Corolla"

[21] "Toyota Corona" "Dodge Challenger" "AMC Javelin" "Camaro Z28"

[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" "Lotus Europa"

[29] "Ford Pantera L" "Ferrari Dino" "Maserati Bora" "Volvo 142E"

$class

[1] "data.frame"

The first section displays the column names.
The second section has the row names, which consists of the model names of all 32 cars
Finally the class of the variables displayed. i.e. data frame.

Let us find what values the data field gear contains
> mtcars$gear
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
>
All the values of the gear column are displayed.
You can see the values 3 4 and 5 are repeated in the data set.
Similarly you can check the number of cyl and carb too.

Suppose you would like to know the attributes of a particular car. For example the "Ferrari Dino". We will use the format to view the row data here.

> mtcars["Ferrari Dino",]
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
>

You can now see the attributes of the "Ferrari Dino" displayed in a single row.

Now that we have explored the data we will now look at summarizing the data to get some meaningful insights.
Before doing this let us first convert the data type for "vs" and "am" to factor.
They are numeric by default.
mtcars$am<-as.factor(mtcars$am)
mtcars$vs<-as.factor(mtcars$vs)
summary(mtcars)

To display all the summary statistics.
From the first section we can infer that the maximum miles per gallon(mpg) is 33.90 and the average is 20.09. The minimum mpg is 10.40.
Similarly, we can note the max, min and other statistics for the rest of the numeric data.
For the categorical data the number of entries in each group is displayed.
For example for the am column it can be inferred that out of 32 cars 19 are automatic and 13 are manual.

What if we need to know which car among the 32 gives the best milage?
The which.max() function is used to return the index of the maximum value in the data set.
> mtcars[which.max(mtcars$mpg),]
which.max() returns the index of the maximum value for mpg variable and attaching that with the mtcars variable displays the row with the maximum value for mileage.
You can see that "Toyota Collolla" has the maximum milage in the given set of cars.

Similarly the row with the minimum value can be found using the which.min() function.
Let us find the lightest car in the given set of cars.
> mtcars[which.min(mtcars$wt),]
We now know that the "Lotus Europa" is the lightest of the cars.

The commands mentioned now can thus be used to find the maximum or minimum of any variable and the corresponding car data.

Suppose we want to find the milage of the cars with three gears.
We will use the aggregate() function to aggregate the mpg variable by the variable gear.
and the aggregation function will be average.
> agrregate(mpg~gear, data=mtcars, FUN="mean")

The output displays the two columns the number of gears and the means of the mileages.
We can infer that the cars with 3 gears have the mileage of 16.1 mpg.The cars with 4 gears are the best in terms of milage, giving the highest average mileage.
The aggegate() function can be similarly used to find the average speed of an automatic transmission vehicles and so on.

Following mathematical exploration techniques we will now visualize data and try to support the inferences made through the earlier commands.

1. Let us start with a scatter plot that includes all the factors in the data set being plotted against each other.
> plot(mtcars)
A new window appears with a plot with all the variables plotted against every other variable in the data set.

2. This creates a very generic plot of the variables. And it is hard to make any useful inferences from this plot. Suppose we want to know how the horse power (hp) is related to the displacement (disp). Let us use the plot function with formula hp~disp to get the corresponding plot.
>attach(mtcars)
>plot(hp~disp)
We can infer from the scatter plot that the horse power even though gradually as the displacement increases. You can see one data point that might occur as outlier. The horse power is quite high for large displacement vehicle. This shows that there are other factors that have an effect on horse power.

3. Let us perform a similar plot with cyl verses the milage.
> plot(cyl~mpg)
You can see that as the number of cylinders increses the milage decreases and the effect is quite visible.
This plotting can be done across every combination of variables and such inferences can be obtained. Though it is statistically possible to create plots for any two variables during a case study it is advised to understand the data before plotting them. For example a plot of the employee ID versues age will work in R that will not help researcher getting any meaningful insights since they are logically unrelated.

4. Lets work on the categorical variables. We have the am variable which has two values 0 for automatic and 1 for manual transmissions. The summary statistics will show us the number of automatic and manual cars. But if we have thousands of data it will be difficult to obtain a ratio of two categories. You can graphically view the ratio using the pie charts as seen in the previous lessons. We will use the pie function. We will create a table of the am variable using table(mtcars$am) and pass it as an argument to the pie function.
>pie(table(mtcars$am)
From the chart it is seen that almost 60% of the cars are roughly an automatic cars in this case.

5. Now let us do a very simple pie chart plotting. Let us consider plot for am data. am is for automatic or manual property of cars. So if it is needed to find if number of automatic given by 1 is more or less than number of manual this simple plot can be done.
Type some name of the data frame say auto to extract the data of am in the data set of mtcars.
>auto=mtcars$am
Then it is needed to make a table. So make another data frame to assign table auto.
>matic=table(auto)
Then for the pie plot type:
> pie(matic)
Now from the pie plot it is evident that the number of manual is more than number of automatic.

6. Similarly create a pie chart for number of gears.
> auto=mtcars$gears
> matic=table(auto)
> pie(matic)
We can infer that almost half the cars have 3 gears. The other half is distributed between 4 and 5 geared automobiles, with cars having 5 gears having the least number.

7. The contineous variables can be viewed using boxplot as we have seen earlier. Let us create the box plot of the horse power (hp) grouped by the number of cylinders.
> boxplot(hp~cyl, data=mtcars)
You can see that the horse power increases as the number of cylinders increase.
For cars with 8 cylinders there is a lot of variation as can be seen the wider box and whiskers in the plot. Out of the three cars with 6 cylinders have very less variation though there seems to be an outlier as seen by the dot in the graph. It can also be seen that the median of the horsepower of the cars with 4 and 6 cylinders do not differ much.

8. Let us make few bar plots and try to make sense of the graphs. Suppose you want to know the number of cars grouped by the number of gears. To know this use table(mtcars$gear) to create a table of the counts and pass this as an argument to the bar plot function.

> table(mtcars$gear)
> barplot(table(mtcars$gear))
From the plot we can find that maximum number of cars i.e. 15 cars have 3 gears, 12 cars have 4 gears and 5 cars have 5 gears.
Similarly this can be done for number of cylinders.
> barplot(table(mtcars$cyl))
Here we can see that most cars in a data set have 8 cylinders and the least cars have 6 cylinders. This inference from the bar plot can be done for all the street grouping factors.

9.As the final discriptive statistic let us make the histogram plot for the horse power of the cars.
> hist (mtcars$hp)
The histogram divides the horse power into the bucket of 50 units starting at 50 and ending at 350. Thus having 6 bars. We can see that most cars fall into the 100 to 150 horse power range and only 2 cars have horse power of above 250

This concludes the case study on descriptive statistics using R.

Correlation:

Let us look at the function to test the correlation between two variables.
- Class of statistical relationship between variables that forms any form of dependence.
For example is there a correlation between the height of parents and their offspring.
In the example given we try to find if there is a correlation between the Sepal Length and Width of a flower.
- cor.test(col1,col2)
In R correlation can be calculated using the cor.test() function.
By default the function calculates the Pearson's correlation coefficient.
Let's look at how to interpret the results.

>attach (iris)
>cor.test(Sepal.Length, Sepal.Width)
Pearson's product-moment correlation.
data: Sepal.Length and Sepal.Width
t = -1.4403, df = 148, p-value=0.1519
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.27269325 0.04351158
sample estimates:
cor
-0.1175698

The important value in this calculation is the p-value which is calculated using the t statistics and the degrees of freedom. As seen in earlier chapters if the p-value is less than 0.05 we can conclude that the null hypothesis is rejected. i.e. There is no correlation between the two variables. The correlation coefficient is given as 0.1175698. This means that there might be a negative correlation between the two variables. But since the p-value is quite high. We conclude that the result is not significant. i.e. the correlation is almost 0.

Example of correlation in R

In the given data set we are going to perform a correlation test two variables on a data set to acertain the significance of the relation in values of the variables. We will use the anorexia data set from the mass package for the correlation.
Load the package by typing:
> library("MASS")

> anorexia
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
4 Cont 74.0 86.3
5 Cont 78.1 76.1
6 Cont 88.3 78.1
7 Cont 87.3 75.1
8 Cont 75.1 86.7
9 Cont 80.6 73.5
10 Cont 78.4 84.6
11 Cont 77.6 77.4
12 Cont 88.7 79.5
13 Cont 81.3 89.6
14 Cont 78.1 81.4
15 Cont 70.5 81.8
16 Cont 77.3 77.3
17 Cont 85.2 84.2
18 Cont 86.0 75.4
19 Cont 84.1 79.5
20 Cont 79.7 73.0
21 Cont 85.5 88.3
22 Cont 84.4 84.7
23 Cont 79.6 81.4
24 Cont 77.5 81.2
25 Cont 72.3 88.2
26 Cont 89.0 78.8
27 CBT 80.5 82.2
28 CBT 84.9 85.6
29 CBT 81.5 81.4
30 CBT 82.6 81.9
31 CBT 79.9 76.4
32 CBT 88.7 103.6
33 CBT 94.9 98.4
34 CBT 76.3 93.4
35 CBT 81.0 73.4
36 CBT 80.5 82.1
37 CBT 85.0 96.7
38 CBT 89.2 95.3
39 CBT 81.3 82.4
40 CBT 76.5 72.5
41 CBT 70.0 90.9
42 CBT 80.4 71.3
43 CBT 83.3 85.4
44 CBT 83.0 81.6
45 CBT 87.7 89.1
46 CBT 84.2 83.9
47 CBT 86.4 82.7
48 CBT 76.5 75.7
49 CBT 80.2 82.6
50 CBT 87.8 100.4
51 CBT 83.3 85.2
52 CBT 79.7 83.6
53 CBT 84.5 84.6
54 CBT 80.8 96.2
55 CBT 87.4 86.7
56 FT 83.8 95.2
57 FT 83.3 94.3
58 FT 86.0 91.5
59 FT 82.5 91.9
60 FT 86.7 100.3
61 FT 79.6 76.7
62 FT 76.9 76.8
63 FT 94.2 101.6
64 FT 73.4 94.9
65 FT 80.5 75.2
66 FT 81.6 77.8
67 FT 82.1 95.5
68 FT 77.6 90.7
69 FT 83.5 92.5
70 FT 89.9 93.8
71 FT 86.0 91.7
72 FT 87.3 98.0

We will use the two variables of the data set i.e. (Prewt, Postwt)
Run the command:
> attach(anorexia)
> cor.test(Prewt, Postwt)

Pearson's product-moment correlation

data: Prewt and Postwt
t = 2.9488, df = 70, p-value = 0.004334
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1091425 0.5237424
sample estimates:
cor
0.3324062

The output shows the name of the test i.e. "Pearson's product-moment correlation" in this case followed by a t value, degrees of freedom and the p-value. Then it tells you about the alternate hypothesis which in our case is "true correlation is not equal to 0". Next it shows the 95% confidence values of the correlation and finally the mean of the correlation values.

Case Study - Correlation

An Icecream company maintains a database of total sales per day and the corresponding temperature every day. Given the data for a particular week, the management wants to find whether temperature is a factor that could potentially cause a difference in their sales.

Sales Temperature

$412 13.5

$615 16.8

$212 11.4

$700 17

$500 14

$456 15.3

$300 13.2

Looking at the data set we can infer that there are two variables sales and temperature. Here sales represents the value of total sales that happens in the icecream company each day and temperature represents the temperature this is coded for the corresponding day measures in celcius. We would be using correlation to check whether there is any relationship between sales and temperature for this company.

Firstly save the data into a csv file.

1. Import the file in R

> mydata<-read.csv("icecream.csv")

You can check if the data has been stored properly

> mydata

Sales Temperature

1 412 13.5

2 615 16.8

3 212 11.4

4 700 17.0

5 500 14.0

6 456 15.3

7 300 13.2

The data will have 7 entries across two columns.

Before working on mydata attached the mydata dataset to the session of mydata.

> attach(mydata)

The correlation coefficient is now determined by using the command:

> cor.test(Sales,Temperature)

Pearson's product-moment correlation

data: Sales and Temperature

t = 6.6333, df = 5, p-value = 0.001173

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.6792990 0.9924498

sample estimates:

cor

0.9476071

The correlation result is displayed. It can be inferred that the "Pearson's product-moment" test has been performed. The test can be changed by using the method attribute. The t-value for these variables are calculated as 6.6333. The degrees of freedom is 5 and the p-value is 0.001173. The basic alternative hypothesis is displayed and the 95% confidence interval is calculated and displayed. "Pearson's correlation coefficient" is displayed in the last line and equals 0.9476071. From the earlier lessons we know that the positive sign in the correlation coefficient implies that they both are positively correlated i.e. an increase in temperature relates to increase in sales. The value is close to 1 which suggest that association between the two variables is very strong.

ANNOVA (Analysis of Variance)

Analysis of Variance (ANNOVA)
ANNOVA is used to compare the means of different groups.
aov() - generic method to implement Analysis of Variance

Here we show a simple example of one way ANNOVA. To illustrate this we use the insect sprays data set. This data contains the insect count after using 6 different sprays. Here the null hypothesis would be there is no difference between insects by using different sprays.

> InsectSprays
count spray
1 10 A
2 7 A
3 20 A
4 14 A
5 14 A
6 12 A
7 10 A
8 23 A
9 17 A
10 20 A
11 14 A
12 13 A
13 11 B
14 17 B
15 21 B
16 11 B
17 16 B
18 14 B
19 17 B
20 17 B
21 19 B
22 21 B
23 7 B
24 13 B
25 0 C
26 1 C
27 7 C
28 2 C
29 3 C
30 1 C
31 2 C
32 1 C
33 3 C
34 0 C
35 1 C
36 4 C
37 3 D
38 5 D
39 12 D
40 6 D
41 4 D
42 3 D
43 5 D
44 5 D
45 5 D
46 5 D
47 2 D
48 4 D
49 3 E
50 5 E
51 3 E
52 5 E
53 3 E
54 6 E
55 1 E
56 1 E
57 3 E
58 2 E
59 6 E
60 4 E
61 11 F
62 9 F
63 15 F
64 22 F
65 15 F
66 16 F
67 13 F
68 10 F
69 26 F
70 26 F
71 24 F
72 13 F
> fit <- aov(count~spray, data=InsectSprays)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2669 533.8 34.7 <2e-16 ***
Residuals 66 1015 15.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

aov() takes the fist attribute as dependent variable and independent variable separated by tilda(~) here it is count~sprays, the data attribute specifies where the data has to be taken from:
> aov(count~spray, data = InsectSprays)
Call:
aov(formula = count~spray, data = InsectSprays)
Terms:
Spray Residuals
Sum of Squares 2668.833 1015.167
Deg. of Freedom 5 66

Residual standard error: 3.921902
Estimated effects may be unbalanced

> fit < aov(count~spray, data = InsectSprays)

> summary(fit)
Df Sum Sq Mean Sq F Value Pr(>F)
spary 5 2669 533.8 34.7 <2e-16 ***
Residuals 66 1015 15.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After fitting the aov model the summary function is used to display the result. It can be seen that the p-value that is the last column is lesser than 2e-16 which is the least positive number represented by R. Since p-value is almost 0. We can reject the null hypothesis. And thus the inference is that there are difference in results on using different sprays.

Now let us perform ANNOVA in R using the InsectSpary data set.
For ANNOVA a model needs to be fitted where we have to decide on a target variable and one or more independent variables.
1. In our case let us take count as the target variable and spray as the independent variable.
2. For fitting the model use aov() function and save it in a user defined model name say fit.
3. To check the model details run the command: summary(fit)
4. The result shows a table containing 5 major columns that are in order:
Df Degrees of Freedom
Sum Sq Sum of Squares
Mean Sq Mean Squared Error
F value F value
Pr(>f) P value

For each of the independent variables used in the model. Finally the significance of the variables can be checked from the P-value in the table and the important variables can then be used for further advanced modelling.

Case Study - ANNOVA

A pharmaceutical company has tested three formulations of a body pain for athletes. There are 27 volunteers selected and 9 were assigned randomly to one of the medications. The athletes were instructed to take the medication during body pain and report the pain in a scale of 1 to 10. (10 being too much of pain).
Med_1 Med_2 Med_3
4 6 6
5 8 7
4 4 6
3 5 6
2 4 7
4 6 5
Here the table shows the means rating across each athlete for each medication.
The question the company wants to answer is, is there a significant difference in the effect of different medication.
Since there are more than two different groups we will use ANNOVA to test the significance differences between means of all three medications.

1. Save the file in csv format
Import the data using the following command and save it in mydata variable.
> mydata<-read.csv("medication.csv")
> mydata
Med_1 Med_2 Med_3
1 4 6 6
2 5 8 7
3 4 4 6
4 3 5 6
5 2 4 7
6 4 6 5
7 3 5 6
8 4 8 5
9 4 6 5

2. Alternatively you can use the head(mydata) to display the first 6 rows.

3. To apply R's ANNOVA function the data has to be in a single vector format. To do so let us convert first 3 columns of mydata into a single vector. Use the c function to do the conversion after attaching the data to this session. Save it in a variable painrating.
> attach(mydata)
> painrating<-c(Med_1,Med_2,Med_3)
Now type painrating to see if the vector has been properly stored.
> painrating
[1] 4 5 4 3 2 4 3 4 4 6 8 4 5 4 6 5 8 6 6 7 6 6 7 5 6 5 5

4. Now the medication type for each entry data needs to be specified. Create a vector that consists of the value of "med_1" for the first 9 entries, "med_2" for the next 9 entries and "med_3" for the last 9 entries. Let us save it in a variable called types. The rep("string",n) function can be used to repeat a given string for a given number of times. To create the types vector we will use the c() function.

> types<-c(rep("med_1",9),rep("med_2",9),rep("med_3",9))
> types
[1] "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1"
[10] "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2"
[19] "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3"
>

5. Finally a data frame needs to be created that contains the medication type and its corresponding pain rating.

Let us save the data frame in a variable called bodypainrating.

> bodypainrating<-data.frame(painrating,types)

You can see the dataframe using bodypainrating by typing:

> bodypainrating

painrating types

1 4 med_1

2 5 med_1

3 4 med_1

4 3 med_1

5 2 med_1

6 4 med_1

7 3 med_1

8 4 med_1

9 4 med_1

10 6 med_2

11 8 med_2

12 4 med_2

13 5 med_2

14 4 med_2

15 6 med_2

16 5 med_2

17 8 med_2

18 6 med_2

19 6 med_3

20 7 med_3

21 6 med_3

22 6 med_3

23 7 med_3

24 5 med_3

25 6 med_3

26 5 med_3

27 5 med_3

Now we have the data frame ready.

6. Before using ANNOVA let us visualize the data in a graphical format. We will be using the plot function to plot the rating as a function of the medication type.

> boxplot(painrating~types, data=bodypainrating)

The boxplot of the ratings is displayed in the new window. Note that there is an outlier under medication 2 rating entries. It can also be noted that the median for medications 2 and 3 are the same and the median for medication 1 is significantly lower than the other two.

7. Use the aov() function to do the ANNOVA analysis.

> annovaresult<-aov(painrating~types, data=bodypainrating)

> summary(annovaresult)

Df Sum Sq Mean Sq F value Pr(>F)

types 2 28.22 14.111 11.91 0.000256 ***

Residuals 24 28.44 1.185

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

6. In the previous result you can see that the value of F statistics is 11.91 with a P-value i.e. 0.000256. Since the P-value is significantly lower than the standard value of 0.05 we can infer that the null hypothesis needs to be rejected. In the context of this study we can conclude that different medication formulations have different impacts on pain ratings for athletes.

Chi-Squared test

Implement and print the results of chi-squared test for goodness of fit

chisq.test(variable) or chisq.test(variable,probabilities)

It compared the observed values against expected values obtained from a null hypothesis.

Here we try to test the Sepal.Length for goodness of fit. The low P-value suggest we can reject the null hypothesis.

We will now look at how to perform a chi-squared test in R.

For this test we need a catagorical data set and R has that data pre-installed in it.

The data set is called HairEyeColor

> HairEyeColor

, , Sex = Male

Eye

Hair Brown Blue Hazel Green

Black 32 11 10 3

Brown 53 50 25 15

Red 10 10 7 7

Blond 3 30 5 8

, , Sex = Female

Eye

Hair Brown Blue Hazel Green

Black 36 9 5 2

Brown 66 34 29 14

Red 16 7 7 7

Blond 4 64 5 8

1. Call that data using the data() function. Run the data(HairEyeColor) and it becomes active in the workspace.

> data(HairEyeColor)

2. To handle the data more efficiently save it in a user defined variable say mydata and put it in a data frame.

> mydata<-data.frame(HairEyeColor)

3. Then do a head(mydata) to have a look at the data set.

> head(mydata)

Hair Eye Sex Freq

1 Black Brown Male 32

2 Brown Brown Male 53

3 Red Brown Male 10

4 Blond Brown Male 3

5 Black Blue Male 11

6 Brown Blue Male 50

> mydata

Hair Eye Sex Freq

1 Black Brown Male 32

2 Brown Brown Male 53

3 Red Brown Male 10

4 Blond Brown Male 3

5 Black Blue Male 11

6 Brown Blue Male 50

7 Red Blue Male 10

8 Blond Blue Male 30

9 Black Hazel Male 10

10 Brown Hazel Male 25

11 Red Hazel Male 7

12 Blond Hazel Male 5

13 Black Green Male 3

14 Brown Green Male 15

15 Red Green Male 7

16 Blond Green Male 8

17 Black Brown Female 36

18 Brown Brown Female 66

19 Red Brown Female 16

20 Blond Brown Female 4

21 Black Blue Female 9

22 Brown Blue Female 34

23 Red Blue Female 7

24 Blond Blue Female 64

25 Black Hazel Female 5

26 Brown Hazel Female 29

27 Red Hazel Female 7

28 Blond Hazel Female 5

29 Black Green Female 2

30 Brown Green Female 14

31 Red Green Female 7

32 Blond Green Female 8

4.Just to make things a little more flexible attach the newly made mydata to the workspace.

> attach(mydata)

5. After clarifying the structure of the data set do the chi-squared test.

6. Next for doing the chi-squared test we select two variables from the data set say Hair and Eye and run the command:

> chisq.test(Hair,Eye)

Pearson's Chi-squared test

data: Hair and Eye

X-squared = 0, df = 9, p-value = 1

Warning message:

In chisq.test(Hair, Eye) : Chi-squared approximation may be incorrect

After running the test we find that these two variables are not at all related and have no significance between each other, suggested by the P-value which is equal to 1.

7. Hence we move on to choose another set of variables and take Hair and Freq for the test.

> chisq.test(Hair,Freq)

Pearson's Chi-squared test

data: Hair and Freq

X-squared = 81.333, df = 63, p-value = 0.05993

Warning message:

In chisq.test(Hair, Freq) : Chi-squared approximation may be incorrect

This time we find that there is some amount of significance between these two variables, suggested by the P-value which is equal to 0.05993, very close to the significance level 0.05 which is generally used as benchmark.

T-test

Pairwise t-tests is used to check if there is any difference in paired values. For example marks obtained by a student before and after the training.

To illustrate this the anorexia data set from the MASS package is used.

The data contains pre-treatment and post-treatment weight of paitients.

To implement in R we use t.test() function.

> library("MASS")

> attach(anorexia)

> t.test(Prewt,Postwt,paired=TRUE)

Paired t-test

data: Prewt and Postwt

t = -2.9376, df = 71, p-value = 0.004458

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.6399424 -0.8878354

sample estimates:

mean of the differences

-2.763889

The first attributes are the features to be compared and the attribute paired = TRUE specifies that it is pairwise t-test. The low P-value suggests that the null hypothesis can be rejected i.e. There is a difference between the weights before and after treatment.

Independent T tests

Now we will look at independent t tests.

The independent t tests are used in comparing two values where the value of one variable is not directly related to the other variable. For example marks of students in two different schools.

Here we implement the t test on the Sepal.Length and Sepal.Width of the iris dataset.

> attach(iris)

> t.test(Sepal.Length,Sepal.Width)

Welch Two Sample t-test

data: Sepal.Length and Sepal.Width

t = 36.463, df = 225.68, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

2.63544 2.93656

sample estimates:

mean of x mean of y

5.843333 3.057333

From the results it can be seen that P-value is almost 0 and hence, the null hypothesis that there is no relation can be rejected.

Performing T tests using R

One sample t test

Use the anorexia data set for this purpose.

1. Choose any variable from the data set to perform the test say Prewt.

> t.test(Prewt,mu=3)

One Sample t-test

data: Prewt

t = 130.02, df = 71, p-value < 2.2e-16

alternative hypothesis: true mean is not equal to 3

95 percent confidence interval:

81.19051 83.62615

sample estimates:

mean of x

82.40833

The mu parameter corresponds to the null hypothesis.

3. The output shows the name of the test which is one sample t test. followed by the t-value, degrees of freedom and the P-value.

4. Then it tells you about the alternate hypothesis which in our case is true mean is not equal to 3. Since we passed 3 as the value for the mu parameter in our test.

5. Next it shows the 95% confidence values of the variable and finally the mean of the values of the variables.

Two Sample T tests.

1. Choose any two variables of the data set to perform the test, say Prewt and Postwt.

2. Run the command

> t.test(Prewt,Postwt)

Welch Two Sample t-test

data: Prewt and Postwt

t = -2.4528, df = 121.36, p-value = 0.0156

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.9946831 -0.5330947

sample estimates:

mean of x mean of y

82.40833 85.17222

3. The output shows the name of the test i.e. "Welch Two Sample t-test" followed by the t-value, degrees of freedom and the P-value.

4. Then it tells about the alternate hypothesis which in out case is "true difference in means is not equal to 0".

5. Next it shows the 95% confidence values of the combinations of the variables and finally the means of the values of the variables.

Pairwise T test

1. Choose any two variables from the data set to perform the test, say Prewt and Postwt.

2. Run the command

> t.test(Prewt,Postwt,paired=TRUE)

Paired t-test

data: Prewt and Postwt

t = -2.9376, df = 71, p-value = 0.004458

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.6399424 -0.8878354

sample estimates:

mean of the differences

-2.763889

3. The output shows the name of the test i.e. "Paired t-test" followed by the t-value, degrees of freedom and the P-value.

4. Then it tells about the alternate hypothesis which in out case is "true difference in means is not equal to 0".

5. Next it shows the 95% confidence values of the combinations of the variables and finally the mean of the difference of the corresponding values of the variables.

Case Study

A Construction company took a project of building a luxary residential apartments in phoenix city. The project head Andrew decided to make an analysis on other residential apartment constructions near by phoenix in order to avoid management difficulties and to easily compete with their competitors. So he asked his collegues to collect all relevant information that helps him in finding useful insights. He wants to find all the business insights possible so that he can compete with all other competitors and also avoid taking much risk in terms of decision making and investments. His collegues started collecting relevant data. For this he did a survey of various construction sites in and around the locality. The final data that he collected contained information on 12 variables.

ID Local_Price Bathroom Area_SqFt Liv_Sp_SqFt Garrage_Cnt

Rooms_Cnt Bedrooms_Cnt Age_yr Material Level Selling_Price

Let us assume that you are the analytics expert in the company and have been assigned the task of working with the data to get insights. Now we have the data gathered and ready for analysis.

Now we have the data gathered and ready for analysis:

Import data into R.
> mydata <- read.csv("c:/house.csv", header=TRUE)
We will see the first 6 data entries and check if we have chosen the correct data set.
> head(mydata)
Let us check the min, mean, max, 1st and 3rd quartile
Material and Level are categorial showing the count
> summary (mydata)
Though R has individual statistic functions like "sd", "var" functions to obtain these values we will use a library package to obtain al these statistics.
> install.package("pastecs")
> library(pastecs)
We use the following function to obtain the comprehensive statitics.
> stat.desc(mydata)
Following statistics are shown:
nbr.val
nbr.null
nbr.na
min
max
range
sum
median
SE.mean
CI.mean.0.95
var
std.dev
coef.var

> attach(mydata)
> hist(Selling_Price)
Most of the houses are sold in 20 to 40 million

Subsetting the data
mydata[mydata$Selling_Price>80,]

aggregate(Selling_Price~Bedrooms_Cnt, data=mydata, FUN="mean")
aggregate(Selling_Price~Garrage_Cnt, data=mydata, FUN="mean")
cor(mydata[unlist(lapply(mydata,is.numeric))])
cor.test[Rooms_cnt,Selling_Price]
fit <-aov(Selling_Price~Level,data=mydata)
fit
plot(Selling_Price~Level)
fit < aov(Selling_Price~Material,data=mydata)
fit
summary(fit)

Case Study - Independent T tests

A researcher studying the effect of a drug on the growth of tumors whishes to find if the drug reduces the growth of tumors. The researcher obtains mice with tumors and divides them into two groups. One group of mice were injected with the drug, and other group was used as the control group. After one week, the tumors were weighed again and the results were noted as follows: The research question is, does the drug have an effect of reducing the tumor size?

Treated with Drug Not Treated with Drug

0.72 0.71

0.68 0.83

0.69 0.89

0.66 0.57

0.57 0.68

0.66 0.74

0.7 0.75

0.63 0.67

0.71 0.8

0.73 0.78

Since these two belong to two different groups of mice and have no dependencies with each other. We will use independent t tests.

tumordata <- read.csv("c:/tumor.csv", header=TRUE)

head(tumordata)

boxplot(tumordata)

attach(tumordata)

ttest(treatedwithdrug,nottreatedwithdrug, alternative="1")

Business Analytics using R

Tuesday, February 23, 2016

3 Basic Analytic Techniques using R

Objectives

Introduction to R

Basic R

Data exploration using R

Viewing Data Frame

Dimensions of data frame

Attribute of Data Frame

Viewing Column data

Viewing row data

Demo in R

Summarizing data in R

Individual Summaries

Subset Summary

Data Visualization in R

Pie Charts

Bar Plots

Box Plot

Histograms

Demo

Case Study - Data Exploration

Correlation:

Example of correlation in R

Case Study - Correlation

ANNOVA (Analysis of Variance)

Case Study - ANNOVA

Chi-Squared test

T-test

Independent T tests

Performing T tests using R

One sample t test

Two Sample T tests.

Pairwise T test

Case Study

Case Study - Independent T tests

No comments:

Post a Comment