Tuesday, February 23, 2016

3 Basic Analytic Techniques using R

Objectives


  • Get a basic introduction to R
  • Understand exploration of data
  • Explore data using R
  • Visualize data using R
  • Understand diagnostic analytics
  • Implement diagnostic analytics using R
  • Understand these concepts with the help of case studies

Introduction to R

  • Programming language for graphics and statistical computations
  • Available freely under the GNU public license
  • Used in data mining and statistical analysis
  • Included time series analysis, linear and non-linear modeling among others
  • Very active community and pacakage contributions
  • Very little programming knowledge necessary
  • Can be downloaded from http://www.r-project.org/
  • R-Studio is and IDE for programming in R that is freely available (optional)

Basic R

  • Packages
    • install.packages('package_name')
    • library(package_name)

  • Loading Data
    • data(dataset_name)
    • read and write functions
    • getwd() and setwd()
    • read and write functions use full path name
    • Example: read.csv("C:/Rtutorials/Sampledata.csv")

  • Assignment operator: "<-"
  • Help ?function_name

Data exploration using R

  • Basic functions for data exploration in R
  • Data stored as "data frames"
  • Data frames - tabular representation of data with rows and columns
  • There are other types of data as well, including tables, matrices, vectors and single values
Sample data frame (Sir Donald Fisher's Iris data set)
Sapal length Sepal Width Petal Length Petal Width Species
7.9 3.8 6.4 2 I.virginica
7.1 3 6.1 2.3 I.virginica
5.6 2.5 3.9 1.1 I.versicolor
5.6 2.8 4.9 2 I.virginica
5.5 4.2 1.4 0.2 I.setosa
5.5 3.5 1.4 0.2 I.setosa
7.1 3 5.9 2.1 I.virginica

The three species have 50 entries each forming total of 150 entries. Each row denotes features of a particular flower.

Viewing Data Frame

iris
head (iris)
tail (iris)
head (iris,2)
tail (iris,2)

Dimensions of data frame


  • View the dimensions of the data set
    • dim(iris)
      [1] 150 5
  • View the number of columns
    • ncol(iris)
      [1] 5
  • View the number of rows
    • nrow(iris)
      [1] 150

Attribute of Data Frame


  • View column names/headers
    • names(iris)
  • View all attributes
    • attributes(iris)
    • class (iris)
      [1] data.frame
  • class (iris$sepal.length)

Viewing Column data

View Column data
iris@Petal.Length
iris[,"Petal.Length"]

Viewing row data 


  • Viewing particular row
    • iris[10:15,]
  • View particular rows of a single column
    • iris[10:15,"Petal.Length"]

Demo in R

install("forecast")
library("forecast")
data(iris) ... to load the data set iris
attach(iris) ... to save the data set iris in workspace
detach(iris) ... to detach the dataset iris from workspace
mydata<-iris ... we load the dataset into a user defined variable mydata
mydata=iris  ... we can also use the equal sign instead
head(mydata) ... view the first 6 rows to check the structure of data set
CTRL + l     ... clears the screen of current workspace
getwd()      ... returns the current working directory we are in
setwd("c:/") ... to set the current working directory
data<-read.csv("c:/europe.csv",header=TRUE)
             ... to load a csv file into a user variable
head(data)   ... check the structure of the data 
data.frame(data)
             ... covert data into data frame
tail(data)   ... returns last 6 rows
head (data,12) 
             ... returns first 12 rows of data
dim(data)    ... shows the dimension of data i.e. number of rows and columns
[1] 28 7
ncol(data)   ... shows the number of columns
[1] 7
nrow(data)   ... shows the number of rows
[1] 28
names(data)  ... shows the column names
[1] "Country" "Area" "GDP" "Inflation" "Life.expect"
[6] "Military" "Pop.growth"
data$Country ... shows the data in column name 'Country'
data[,"GDP"] ... shows the data in column name 'GDP'
data[,3]     ... shows the data in column number 3
data[7,]     ... shows the data in row number 7
data[7:16,]  ... shows the data from row number 7 to 16

Summarizing data in R

  • summary(data_frame)
    • summary(iris)
  • Summary function displays the following:
    Min, 1st Qtr, Median, Mean, 3rd Qtr, Max Values for Numerical Data
    For catagorical data like species it displays the table of different values and its frequencies
  • The same table can be displayed separately by using the following command:
  • table(dataframe$columnname)
    • table(iris$Species)
  • Same table can be displayed using the following command for a specific column.
    • summary(iris$Sepal.Lenght)

Individual Summaries

  • min(column_name)
  • max(column_name)
  • range(column_name) i.e. maximum-minimum
  • mean(column_name)
  • median(column_name)
  • IQR(column_name) i.e. Inter quartile range
  • sd(column_name) i.e. Standard Deviation
  • var(column_name) i.e. Variance

Subset Summary

  • aggregate() - function for column wise aggregation
    Numerical summaries for a subset of data
  • aggregate(formula, data, function)
  • aggregate(.~Species, data =  iris, FUN= mean)
    Where,
    Species is the column name for which aggregation is done
    . indicates all columns
    iris is the data set
    mean is the function
  • aggregate(Sepal.Length~Species, data =  iris, FUN= mean)
    Species is the column name for which aggregation is done.
    Sepal Length is the column which only will be aggregated.
    iris is the data set
    mean is the function
  • Apart from mean there are many other aggregate functions you can use like sum.

Data Visualization in R

  • plot() - generic function for plotting in R
  • The function can be used to plot a variety of graphs on a variety of data including data frames, time series and even vectors.
  • The plot function plots a scatter plot by default. Other plots can be created by using a type attribute. You can check the help section for more details on the plot function.
  • plot(iris)
    Shown here is a plot of iris data. Plotting the data frame it creates a pair wise data plot for all the attributes in the data frame.
 plot() function can be used to plot one variable against another
For example let us plot Sepal Length against Species
We will use a few optional attributes of the plot function.
main to specify the title of the plot
xlab the x axis label
ylab the y axis label

plot(iris$Sepal.Length,iris$Species, 
     mail = "IRIS Data", 
     xlab = "Sepal Length",
     ylab = "Species")

Pie Charts

  • Pie charts to visualize the numerical proportions of different classes through sectors of the circle.
    • pie(table(iris$Species), main = "Iris Data by Species")
      • The table function is used to create a frequency table and then the pie function is called to create the chart of the table. 
      • The main attribute is used to specify the title for the chart.
  • Here is an example chart showing the different species of the IRIS data. The circle is divided in the three equal sectors i.e. for the three species.

Bar Plots

Bar Plots are used to dipict values in vertical manner with the height being equivalent to the value that is been shown.

  • USPersonalExpenditure
    • barplot(USPersonalExpenditure, main = "US Personal Expenditure by Year", xlab = "Year", ylab = "Expenditure")

1940 1945 1950 1955 1960
Food and Tobacco 22.2 44.5 59.6 73.2 86.8
Household Operation 10.5 15.5 29 36.5 46.2
Medical and Health 3.53 5.76 9.71 14 21.1
Personal Care 1.04 1.98 2.45 3.4 5.4
Private Education 0.341 0.974 1.8 2.6 3.64

Box Plot

Box plot are used to show the numerical data with their quartile ranges also called a box whisker plot. Boxes show the inter quartile region with the middle line equal to the median. The whiskers show the lower and upper quartiles. And the points show the outliers.


  • boxplot(Sepal.Length~Species, data = iris, main = "IRIS Data Set", xlab = "Species Type", ylab = "Sepal Lenght")


Histograms 

Histograms are used to depict the frequency distribution data

  • islands data set
  • hist(iris$Sepal.Length, main = "IRIS Data Set", xlab = "Sepal Length", ylab = "Frequency")

In addition to these attributes plots can have other attributes such as:
x and y axis limits
color to points or bar or lines, etc.
They can be found on the help section for each plot.

Demo

  • plot(iris)
  • plot(x,y)
  • plot(iris$Sepal.Length, iris$Species)
  • pie(table(iris$Species))
  • barplot(table(iris$Species))
  • boxplot(Sepal.Length~Species,data=iris)
  • hist(iris$Sepal.Length)

Case Study - Data Exploration

Let now take a data set from Bank as an example for doing data exploration

Officer David from the sales department has been asked to do a study on the overall performance of the bank loans. As a part of investigation, David has asked his colleague to categorize the loans as seasoned or bad loans with respect to Age category.
  • The following data has been captured:
    • Age Group 
    • Total number of loans
    • Number of good loans
    • Number of bad loans
    • Percentage of good loans
    • Percentage of bad loans
A look at the data:
The given table is of the bank loan data
The number of good loans and bad loans has been tabulated as per the age category.
There are 15 catagories of age in the table.
Given the total number of loans and number good and bad loans the percentage of good and bad loans has been tabulated.
Now we will look at different statistics and graphical visualization for the data.

Since it is a preloaded data in R we will not be using any read command however if you wish to save it to an excel file use the write.csv command:
The write.csv command takes two arguments the data frame name and the name of the file you wish to write it to.
> write.csv(mtcare,"MotorTrendsCarData.csv")
The file MotorTrendsCarData.csv will be saved in your working directory.

Since it is a preloaded data in R we will not be using any read command however if you wish to save it to an excel file use the write.csv command:
The write.csv command takes two arguments the data frame name and the name of the file you wish to write it to.
> write.csv(mtcare,"MotorTrendsCarData.csv")
The file MotorTrendsCarData.csv will be saved in your working directory
First let us view the data set.
>mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
>
Let us find the class of the data set in R
> class(mtcars)
[1] "data.frame"
From the output we can infer that the data is stored as a data frame.  

Let us confirm the number of rows and columns in the data set.
> dim (mtcars)
[1] 32 11

The output shows 32 and 11 with a space in between indicating 32 rows and 11 columns
Let us explore the data in detail
Use head(mtcars) to explore the first 6 rows
> head (mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Though we already established the types of the variables let us confirm it by looking at this data.
We can see that vs or am has either 0 or 1 in the data and thus has been rightly classified as catagorical data.
The number of cyl gears and carb are classified as discrete numerical values and rest of the data are continuous numeric format.

To look at all the types if cars
> attributes(mtcars)
$names
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

$row.names
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"          "Hornet 4 Drive"     
 [5] "Hornet Sportabout"   "Valiant"             "Duster 360"          "Merc 240D"          
 [9] "Merc 230"            "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood"  "Lincoln Continental"
[17] "Chrysler Imperial"   "Fiat 128"            "Honda Civic"         "Toyota Corolla"     
[21] "Toyota Corona"       "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"       "Lotus Europa"       
[29] "Ford Pantera L"      "Ferrari Dino"        "Maserati Bora"       "Volvo 142E"         

$class

[1] "data.frame"

The first section displays the column names.
The second section has the row names, which consists of the model names of all 32 cars
Finally the class of the variables displayed. i.e. data frame.

Let us find what values the data field gear contains
> mtcars$gear
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
>
All the values of the gear column are displayed.
You can see the values 3 4 and 5 are repeated in the data set.
Similarly you can check the number of cyl and carb too.

Suppose you would like to know the attributes of a particular car. For example the "Ferrari Dino". We will use the format to view the row data here.

> mtcars["Ferrari Dino",]
              mpg cyl disp  hp drat   wt qsec vs am gear carb
Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

You can now see the attributes of the "Ferrari Dino" displayed in a single row.

Now that we have explored the data we will now look at summarizing the data to get some meaningful insights.
Before doing this let us first convert the data type for "vs" and "am" to factor.
They are numeric by default.
mtcars$am<-as.factor(mtcars$am)
mtcars$vs<-as.factor(mtcars$vs)
summary(mtcars)

To display all the summary statistics.
From the first section we can infer that the maximum miles per gallon(mpg) is 33.90 and the average is 20.09. The minimum mpg is 10.40.
Similarly, we can note the max, min and other statistics for the rest of the numeric data.
For the categorical data the number of entries in each group is displayed.
For example for the am column it can be inferred that out of 32 cars 19 are automatic and 13 are manual.

What if we need to know which car among the 32 gives the best milage?
The which.max() function is used to return the index of the maximum value in the data set.
> mtcars[which.max(mtcars$mpg),]
which.max() returns the index of the maximum value for mpg variable and attaching that with the mtcars variable displays the row with the maximum value for mileage.
You can see that "Toyota Collolla" has the maximum milage in the given set of cars.

Similarly the row with the minimum value can be found using the which.min() function.
Let us find the lightest car in the given set of cars.
> mtcars[which.min(mtcars$wt),]
We now know that the "Lotus Europa" is the lightest of the cars.

The commands mentioned now can thus be used to find the maximum or minimum of any variable and the corresponding car data.

Suppose we want to find the milage of the cars with three gears.
We will use the aggregate() function to aggregate the mpg variable by the variable gear.
and the aggregation function will be average.
> agrregate(mpg~gear, data=mtcars, FUN="mean")

The output displays the two columns the number of gears and the means of the mileages.
We can infer that the cars with 3 gears have the mileage of 16.1 mpg.The cars with 4 gears are the best in terms of milage, giving the highest average mileage.
The aggegate() function can be similarly used to find the average speed of an automatic transmission vehicles and so on.

Following mathematical exploration techniques we will now visualize data and try to support the inferences made through the earlier commands.

1. Let us start with a scatter plot that includes all the factors in the data set being plotted against each other.
> plot(mtcars)
A new window appears with a plot with all the variables plotted against every other variable in the data set.

2. This creates a very generic plot of the variables. And it is hard to make any useful inferences from this plot. Suppose we want to know how the horse power (hp) is related to the displacement (disp). Let us use the plot function with formula hp~disp to get the corresponding plot.
>attach(mtcars)
>plot(hp~disp)
We can infer from the scatter plot that the horse power even though gradually as the displacement increases. You can see one data point that might occur as outlier. The horse power is quite high for large displacement vehicle. This shows that there are other factors that have an effect on horse power.

3. Let us perform a similar plot with cyl verses the milage.
> plot(cyl~mpg)
You can see that as the number of cylinders increses the milage decreases and the effect is quite visible.
This plotting can be done across every combination of variables and such inferences can be obtained. Though it is statistically possible to create plots for any two variables during a case study it is advised to understand the data before plotting them. For example a plot of the employee ID versues  age will work in R that will not help researcher getting any meaningful insights since they are logically unrelated.

4. Lets work on the categorical variables. We have the am variable which has two values 0 for automatic and 1 for manual transmissions. The summary statistics will show us the number of automatic and manual cars. But if we have thousands of data it will be difficult to obtain a ratio of two categories. You can graphically view the ratio using the pie charts as seen in the previous lessons. We will use the pie function. We will create a table of the am variable using table(mtcars$am) and pass it as an argument to the pie function.
>pie(table(mtcars$am)
From the chart it is seen that almost 60% of the cars are roughly an automatic cars in this case.

5. Now let us do a very simple pie chart plotting. Let us consider plot for am data. am is for automatic or manual property of cars. So if it is needed to find if number of automatic given by 1 is more or less than number of manual this simple plot can be done.
Type some name of the data frame say auto to extract the data of am in the data set of mtcars.
>auto=mtcars$am
Then it is needed to make a table. So make another data frame to assign table auto.
>matic=table(auto)
Then for the pie plot type:
> pie(matic)
Now from the pie plot it is evident that the number of manual is more than number of automatic.

6. Similarly create a pie chart for number of gears.
> auto=mtcars$gears
> matic=table(auto)
> pie(matic)
We can infer that almost half the cars have 3 gears. The other half is distributed between 4 and 5 geared automobiles, with cars having 5 gears having the least number.

7. The contineous variables can be viewed using boxplot as we have seen earlier. Let us create the box plot of the horse power (hp) grouped by the number of cylinders.
> boxplot(hp~cyl, data=mtcars)
You can see that the horse power increases as the number of cylinders increase.
For cars with 8 cylinders there is a lot of variation as can be seen the wider box and whiskers in the plot. Out of the three cars with 6 cylinders have very less variation though there seems to be an outlier as seen by the dot in the graph. It can also be seen that the median of the horsepower of the cars with 4 and 6 cylinders do not differ much.

8. Let us make few bar plots and try to make sense of the graphs. Suppose you want to know the number of cars grouped by the number of gears. To know this use table(mtcars$gear) to create a table of the counts and pass this as an argument to the bar plot function.

> table(mtcars$gear)
> barplot(table(mtcars$gear))
From the plot we can find that maximum number of cars i.e. 15 cars have 3 gears, 12 cars have 4 gears and 5 cars have 5 gears.
Similarly this can be done for number of cylinders.
> barplot(table(mtcars$cyl))
Here we can see that most cars in a data set have 8 cylinders and the least cars have 6 cylinders. This inference from the bar plot can be done for all the street grouping factors.

9.As the final discriptive statistic let us make the histogram plot for the horse power of the cars.
> hist (mtcars$hp)
The histogram divides the horse power into the bucket of 50 units starting at 50 and ending at 350. Thus having 6 bars. We can see that most cars fall into the 100 to 150 horse power range and only 2 cars have horse power of above 250

This concludes the case study on descriptive statistics using R.

Correlation:

Let us look at the function to test the correlation between two variables.
- Class of statistical relationship between variables that forms any form of dependence.
For example is there a correlation between the height of parents and their offspring.
In the example given we try to find if there is a correlation between the Sepal Length and Width of a flower.
- cor.test(col1,col2)
In R correlation can be calculated using the cor.test() function.
By default the function calculates the Pearson's correlation coefficient.
Let's look at how to interpret the results.

>attach (iris)
>cor.test(Sepal.Length, Sepal.Width)
Pearson's product-moment correlation.
data: Sepal.Length and Sepal.Width
t = -1.4403, df =  148, p-value=0.1519
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
  -0.27269325   0.04351158
sample estimates:
          cor
-0.1175698

The important value in this calculation is the p-value which is calculated using the t statistics and the degrees of freedom. As seen in earlier chapters if the p-value is less than 0.05 we can conclude that the null hypothesis is rejected. i.e. There is no correlation between the two variables. The correlation coefficient is given as 0.1175698. This means that there might be a negative correlation between the two variables. But since the p-value is quite high. We conclude that the result is not significant. i.e. the correlation is almost 0.


Example of correlation in R

In the given data set we are going to perform a correlation test two variables on a data set to acertain the significance of the relation in values of the variables. We will use the anorexia data set from the mass package for the correlation.
Load the package by typing:
> library("MASS")

> anorexia
   Treat Prewt Postwt
1   Cont  80.7   80.2
2   Cont  89.4   80.1
3   Cont  91.8   86.4
4   Cont  74.0   86.3
5   Cont  78.1   76.1
6   Cont  88.3   78.1
7   Cont  87.3   75.1
8   Cont  75.1   86.7
9   Cont  80.6   73.5
10  Cont  78.4   84.6
11  Cont  77.6   77.4
12  Cont  88.7   79.5
13  Cont  81.3   89.6
14  Cont  78.1   81.4
15  Cont  70.5   81.8
16  Cont  77.3   77.3
17  Cont  85.2   84.2
18  Cont  86.0   75.4
19  Cont  84.1   79.5
20  Cont  79.7   73.0
21  Cont  85.5   88.3
22  Cont  84.4   84.7
23  Cont  79.6   81.4
24  Cont  77.5   81.2
25  Cont  72.3   88.2
26  Cont  89.0   78.8
27   CBT  80.5   82.2
28   CBT  84.9   85.6
29   CBT  81.5   81.4
30   CBT  82.6   81.9
31   CBT  79.9   76.4
32   CBT  88.7  103.6
33   CBT  94.9   98.4
34   CBT  76.3   93.4
35   CBT  81.0   73.4
36   CBT  80.5   82.1
37   CBT  85.0   96.7
38   CBT  89.2   95.3
39   CBT  81.3   82.4
40   CBT  76.5   72.5
41   CBT  70.0   90.9
42   CBT  80.4   71.3
43   CBT  83.3   85.4
44   CBT  83.0   81.6
45   CBT  87.7   89.1
46   CBT  84.2   83.9
47   CBT  86.4   82.7
48   CBT  76.5   75.7
49   CBT  80.2   82.6
50   CBT  87.8  100.4
51   CBT  83.3   85.2
52   CBT  79.7   83.6
53   CBT  84.5   84.6
54   CBT  80.8   96.2
55   CBT  87.4   86.7
56    FT  83.8   95.2
57    FT  83.3   94.3
58    FT  86.0   91.5
59    FT  82.5   91.9
60    FT  86.7  100.3
61    FT  79.6   76.7
62    FT  76.9   76.8
63    FT  94.2  101.6
64    FT  73.4   94.9
65    FT  80.5   75.2
66    FT  81.6   77.8
67    FT  82.1   95.5
68    FT  77.6   90.7
69    FT  83.5   92.5
70    FT  89.9   93.8
71    FT  86.0   91.7
72    FT  87.3   98.0

We will use the two variables of the data set i.e. (Prewt, Postwt)
Run the command:
> attach(anorexia)
> cor.test(Prewt, Postwt)

Pearson's product-moment correlation

data:  Prewt and Postwt
t = 2.9488, df = 70, p-value = 0.004334
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1091425 0.5237424
sample estimates:
      cor 
0.3324062 

The output shows the name of the test i.e. "Pearson's product-moment correlation" in this case followed by a t value, degrees of freedom and the p-value. Then it tells you about the alternate hypothesis which in our case is "true correlation is not equal to 0". Next it shows the 95% confidence values of the correlation and finally the mean of the correlation values. 

Case Study - Correlation

An Icecream company maintains a database of total sales per day and the corresponding temperature every day. Given the data for a particular week, the management wants to find whether temperature is a factor that could potentially cause a difference in their sales.

Sales Temperature
$412 13.5
$615 16.8
$212 11.4
$700 17
$500 14
$456 15.3
$300 13.2

Looking at the data set we can infer that there are two variables sales and temperature. Here sales represents the value of total sales that happens in the icecream company each day and temperature represents the temperature this is coded for the corresponding day measures in celcius. We would be using correlation to check whether there is any relationship between sales and temperature for this company. 

Firstly save the data into a csv file. 
1. Import the file in R
> mydata<-read.csv("icecream.csv")
You can check if the data has been stored properly
> mydata
  Sales Temperature
1   412        13.5
2   615        16.8
3   212        11.4
4   700        17.0
5   500        14.0
6   456        15.3
7   300        13.2
The data will have 7 entries across two columns.
Before working on mydata attached the mydata dataset to the session of mydata.
> attach(mydata)
>
The correlation coefficient is now determined by using the command:

> cor.test(Sales,Temperature)

Pearson's product-moment correlation

data:  Sales and Temperature
t = 6.6333, df = 5, p-value = 0.001173
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6792990 0.9924498
sample estimates:
      cor 
0.9476071 

The correlation result is displayed. It can be inferred that the "Pearson's product-moment" test has been performed. The test can be changed by using the method attribute. The t-value for these variables are calculated as 6.6333. The degrees of freedom is 5 and the p-value is 0.001173. The basic alternative hypothesis is displayed and the 95% confidence interval is calculated  and displayed. "Pearson's correlation coefficient" is displayed in the last line and equals 0.9476071. From the earlier lessons we know that the positive sign in the correlation coefficient implies that they both are positively correlated i.e. an increase in temperature relates to increase in sales. The value is close to 1 which suggest that association between the two variables is very strong. 

ANNOVA (Analysis of Variance)

Analysis of Variance (ANNOVA)
ANNOVA is used to compare the means of different groups.
aov() - generic method to implement Analysis of Variance

Here we show a simple example of one way ANNOVA. To illustrate this we use the insect sprays data set. This data contains the insect count after using 6 different sprays. Here the null hypothesis would be there is no difference between insects by using different sprays.

> InsectSprays
   count spray
1     10     A
2      7     A
3     20     A
4     14     A
5     14     A
6     12     A
7     10     A
8     23     A
9     17     A
10    20     A
11    14     A
12    13     A
13    11     B
14    17     B
15    21     B
16    11     B
17    16     B
18    14     B
19    17     B
20    17     B
21    19     B
22    21     B
23     7     B
24    13     B
25     0     C
26     1     C
27     7     C
28     2     C
29     3     C
30     1     C
31     2     C
32     1     C
33     3     C
34     0     C
35     1     C
36     4     C
37     3     D
38     5     D
39    12     D
40     6     D
41     4     D
42     3     D
43     5     D
44     5     D
45     5     D
46     5     D
47     2     D
48     4     D
49     3     E
50     5     E
51     3     E
52     5     E
53     3     E
54     6     E
55     1     E
56     1     E
57     3     E
58     2     E
59     6     E
60     4     E
61    11     F
62     9     F
63    15     F
64    22     F
65    15     F
66    16     F
67    13     F
68    10     F
69    26     F
70    26     F
71    24     F
72    13     F
> fit <- aov(count~spray, data=InsectSprays)
> summary(fit)
            Df Sum Sq Mean Sq F value Pr(>F)  
spray        5   2669   533.8    34.7 <2e-16 ***
Residuals   66   1015    15.4                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

aov() takes the fist attribute as dependent variable and independent variable separated by tilda(~) here it is count~sprays, the data attribute specifies where the data has to be taken from:
> aov(count~spray, data = InsectSprays)
Call:
  aov(formula = count~spray, data = InsectSprays)
Terms:
                     Spray  Residuals
Sum of Squares    2668.833   1015.167
Deg. of Freedom          5         66

Residual standard error: 3.921902
Estimated effects may be unbalanced

> fit < aov(count~spray, data = InsectSprays)

> summary(fit)
           Df  Sum Sq  Mean Sq  F Value   Pr(>F)
spary       5    2669    533.8     34.7   <2e-16   ***
Residuals  66    1015     15.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01  '*' 0.05  '.'  0.1  ' ' 1

After fitting the aov model the summary function is used to display the result. It can be seen that the p-value that is the last column is lesser than 2e-16 which is the least positive number represented by R. Since p-value is almost 0. We can reject the null hypothesis. And thus the inference is that there are difference in results on using different sprays.

Now let us perform ANNOVA in R using the InsectSpary data set.
For ANNOVA a model needs to be fitted where we have to decide on a target variable and one or more independent variables.
1. In our case let us take count as the target variable and spray as the independent variable.
2. For fitting the model use aov() function and save it in a user defined model name say fit.
3. To check the model details run the command: summary(fit)
4. The result shows a table containing 5 major columns that are in order:
Df Degrees of Freedom
Sum Sq Sum of Squares
Mean Sq Mean Squared Error
F value F value
Pr(>f) P value

For each of the independent variables used in the model. Finally the significance of the variables can be checked from the P-value in the table and the important variables can then be used for further advanced modelling.


Case Study - ANNOVA

A pharmaceutical company has tested three formulations of a body pain for athletes. There are 27 volunteers selected and 9 were assigned randomly to one of the medications. The athletes were instructed to take the medication during body pain and report the pain in a scale of 1 to 10. (10 being too much of pain).
Med_1 Med_2 Med_3
4 6 6
5 8 7
4 4 6
3 5 6
2 4 7
4 6 5
Here the table shows the means rating across each athlete for each medication.
The question the company wants to answer is, is there a significant difference in the effect of different medication.
Since there are more than two different groups we will use ANNOVA to test the significance differences between means of all three medications.

1. Save the file in csv format
Import the data using the following command and save it in mydata variable.
> mydata<-read.csv("medication.csv")
> mydata
  Med_1 Med_2 Med_3
1     4     6     6
2     5     8     7
3     4     4     6
4     3     5     6
5     2     4     7
6     4     6     5
7     3     5     6
8     4     8     5
9     4     6     5

2. Alternatively you can use the head(mydata) to display the first 6 rows.

3. To apply R's ANNOVA function the data has to be in a single vector format. To do so let us convert first 3 columns of mydata into a single vector. Use the c function to do the conversion after attaching the data to this session. Save it in a variable painrating.
> attach(mydata)
> painrating<-c(Med_1,Med_2,Med_3)
Now type painrating to see if the vector has been properly stored.
> painrating
 [1] 4 5 4 3 2 4 3 4 4 6 8 4 5 4 6 5 8 6 6 7 6 6 7 5 6 5 5

4. Now the medication type for each entry data needs to be specified. Create a vector that consists of the value of "med_1" for the first 9 entries, "med_2" for the next 9 entries and "med_3" for the last 9 entries. Let us save it in a variable called types. The rep("string",n) function can be used to repeat a given string for a given number of times. To create the types vector we will use the c() function.

> types<-c(rep("med_1",9),rep("med_2",9),rep("med_3",9))
> types
 [1] "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1" "med_1"
[10] "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2" "med_2"
[19] "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3" "med_3"

5. Finally a data frame needs to be created that contains the medication type and its corresponding pain rating.
Let us save the data frame in a variable called bodypainrating.

> bodypainrating<-data.frame(painrating,types)

You can see the dataframe using bodypainrating by typing:
> bodypainrating
   painrating types
1           4 med_1
2           5 med_1
3           4 med_1
4           3 med_1
5           2 med_1
6           4 med_1
7           3 med_1
8           4 med_1
9           4 med_1
10          6 med_2
11          8 med_2
12          4 med_2
13          5 med_2
14          4 med_2
15          6 med_2
16          5 med_2
17          8 med_2
18          6 med_2
19          6 med_3
20          7 med_3
21          6 med_3
22          6 med_3
23          7 med_3
24          5 med_3
25          6 med_3
26          5 med_3
27          5 med_3

Now we have the data frame ready.
6. Before using ANNOVA let us visualize the data in a graphical format. We will be using the plot function to plot the rating as a function of the medication type.
> boxplot(painrating~types, data=bodypainrating)

The boxplot of the ratings is displayed in the new window. Note that there is an outlier under medication 2 rating entries. It can also be noted that the median for medications 2 and 3 are the same and the median for medication 1 is significantly lower than the other two.

7. Use the aov() function to do the ANNOVA analysis.

> annovaresult<-aov(painrating~types, data=bodypainrating)
> summary(annovaresult)
            Df Sum Sq Mean Sq F value   Pr(>F)    
types        2  28.22  14.111   11.91 0.000256 ***
Residuals   24  28.44   1.185                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

6. In the previous result you can see that the value of F statistics is 11.91 with a P-value i.e. 0.000256. Since the P-value is significantly lower than the standard value of 0.05 we can infer that the null hypothesis needs to be rejected. In the context of this study we can conclude that different medication formulations have different impacts on pain ratings for athletes.

Chi-Squared test

Implement and print the results of chi-squared test for goodness of fit
chisq.test(variable) or chisq.test(variable,probabilities)
It compared the observed values against expected values obtained from a null hypothesis.
Here we try to test the Sepal.Length for goodness of fit. The low P-value suggest we can reject the null hypothesis. 

We will now look at how to perform a chi-squared test in R.

For this test we need a catagorical data set and R has that data pre-installed in it.
The data set is called HairEyeColor

> HairEyeColor
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8


1. Call that data using the data() function. Run the data(HairEyeColor) and it becomes active in the workspace.
> data(HairEyeColor)

2. To handle the data more efficiently save it in a user defined variable say mydata and put it in a data frame.
> mydata<-data.frame(HairEyeColor)

3. Then do a head(mydata) to have a look at the data set.
> head(mydata)
   Hair   Eye  Sex Freq
1 Black Brown Male   32
2 Brown Brown Male   53
3   Red Brown Male   10
4 Blond Brown Male    3
5 Black  Blue Male   11
6 Brown  Blue Male   50
> mydata
    Hair   Eye    Sex Freq
1  Black Brown   Male   32
2  Brown Brown   Male   53
3    Red Brown   Male   10
4  Blond Brown   Male    3
5  Black  Blue   Male   11
6  Brown  Blue   Male   50
7    Red  Blue   Male   10
8  Blond  Blue   Male   30
9  Black Hazel   Male   10
10 Brown Hazel   Male   25
11   Red Hazel   Male    7
12 Blond Hazel   Male    5
13 Black Green   Male    3
14 Brown Green   Male   15
15   Red Green   Male    7
16 Blond Green   Male    8
17 Black Brown Female   36
18 Brown Brown Female   66
19   Red Brown Female   16
20 Blond Brown Female    4
21 Black  Blue Female    9
22 Brown  Blue Female   34
23   Red  Blue Female    7
24 Blond  Blue Female   64
25 Black Hazel Female    5
26 Brown Hazel Female   29
27   Red Hazel Female    7
28 Blond Hazel Female    5
29 Black Green Female    2
30 Brown Green Female   14
31   Red Green Female    7
32 Blond Green Female    8


4.Just to make things a little more flexible attach the newly made mydata to the workspace.
> attach(mydata)

5. After clarifying the structure of the data set do the chi-squared test. 

6. Next for doing the chi-squared test we select two variables from the data set say Hair and Eye and run the command:
> chisq.test(Hair,Eye)

        Pearson's Chi-squared test

data:  Hair and Eye
X-squared = 0, df = 9, p-value = 1

Warning message:
In chisq.test(Hair, Eye) : Chi-squared approximation may be incorrect


After running the test we find that these two variables are not at all related and have no significance between each other, suggested by the P-value which is equal to 1.

7. Hence we move on to choose another set of variables and take Hair and Freq for the test. 
> chisq.test(Hair,Freq)

        Pearson's Chi-squared test

data:  Hair and Freq
X-squared = 81.333, df = 63, p-value = 0.05993

Warning message:
In chisq.test(Hair, Freq) : Chi-squared approximation may be incorrect

This time we find that there is some amount of significance between these two variables, suggested by the P-value which is equal to 0.05993, very close to the significance level 0.05 which is generally used as benchmark.

T-test

Pairwise t-tests is used to check if there is any difference in paired values. For example marks obtained by a student before and after the training.
To illustrate this the anorexia data set from the MASS package is used.
The data contains pre-treatment and post-treatment weight of paitients.
To implement in R we use t.test() function.

> library("MASS")
> attach(anorexia)
> t.test(Prewt,Postwt,paired=TRUE)

        Paired t-test

data:  Prewt and Postwt
t = -2.9376, df = 71, p-value = 0.004458
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.6399424 -0.8878354
sample estimates:
mean of the differences 
              -2.763889 


The first attributes are the features to be compared and the attribute paired = TRUE specifies that it is pairwise t-test. The low P-value suggests that the null hypothesis can be rejected i.e. There is a difference between the weights before and after treatment. 

Independent T tests

Now we will look at independent t tests.
The independent t tests are used in comparing two values where the value of one variable is not directly related to the other variable. For example marks of students in two different schools. 

Here we implement the t test on the Sepal.Length and Sepal.Width of the iris dataset. 


> attach(iris)
> t.test(Sepal.Length,Sepal.Width)

        Welch Two Sample t-test

data:  Sepal.Length and Sepal.Width
t = 36.463, df = 225.68, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.63544 2.93656
sample estimates:
mean of x mean of y 
 5.843333  3.057333 


From the results it can be seen that P-value is almost 0 and hence, the null hypothesis that there is no relation can be rejected. 

Performing T tests using R

One sample t test

Use the anorexia data set for this purpose.

1. Choose any variable from the data set to perform the test say Prewt.

> t.test(Prewt,mu=3)

        One Sample t-test

data:  Prewt
t = 130.02, df = 71, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 3
95 percent confidence interval:
 81.19051 83.62615
sample estimates:
mean of x 
 82.40833 


The mu parameter corresponds to the null hypothesis.

3. The output shows the name of the test which is one sample t test. followed by the t-value, degrees of freedom and the P-value. 

4. Then it tells you about the alternate hypothesis which in our case is true mean is not equal to 3. Since we passed 3 as the value for the mu parameter in our test.

5. Next it shows the 95% confidence values of the variable and finally the mean of the values of the variables.

Two Sample T tests.


1. Choose any two variables of the data set to perform the test, say Prewt and Postwt.

2. Run the command

> t.test(Prewt,Postwt)

        Welch Two Sample t-test

data:  Prewt and Postwt
t = -2.4528, df = 121.36, p-value = 0.0156
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.9946831 -0.5330947
sample estimates:
mean of x mean of y 
 82.40833  85.17222 


3. The output shows the name of the test i.e. "Welch Two Sample t-test" followed by the t-value, degrees of freedom and the P-value. 

4. Then it tells about the alternate hypothesis which in out case is "true difference in means is not equal to 0".


5. Next it shows the 95% confidence values of the combinations of the variables and finally the means of the values of the variables.

Pairwise T test


1. Choose any two variables from the data set to perform the test, say Prewt and Postwt.

2. Run the command
> t.test(Prewt,Postwt,paired=TRUE)

        Paired t-test

data:  Prewt and Postwt
t = -2.9376, df = 71, p-value = 0.004458
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.6399424 -0.8878354
sample estimates:
mean of the differences 
              -2.763889 

3. The output shows the name of the test i.e. "Paired t-test" followed by the t-value, degrees of freedom and the P-value. 

4. Then it tells about the alternate hypothesis which in out case is "true difference in means is not equal to 0".

5. Next it shows the 95% confidence values of the combinations of the variables and finally the mean of the difference of the corresponding values of the variables.

Case Study

A Construction company took a project of building a luxary residential apartments in phoenix city. The project head Andrew decided to make an analysis on other residential apartment constructions near by phoenix in order to avoid management difficulties and to easily compete with their competitors. So he asked his collegues to collect all relevant information that helps him in finding useful insights. He wants to find all the business insights possible so that he can compete with all other competitors and also avoid taking much risk in terms of decision making and investments. His collegues started collecting relevant data. For this he did a survey of various construction sites in and around the locality. The final data that he collected contained information on 12 variables.
ID Local_Price Bathroom Area_SqFt Liv_Sp_SqFt Garrage_Cnt
Rooms_Cnt Bedrooms_Cnt Age_yr Material Level Selling_Price

Let us assume that you are the analytics expert in the company and have been assigned the task of working with the data to get insights. Now we have the data gathered and ready for analysis.

Now we have the data gathered and ready for analysis:

Import data into R.
> mydata <- read.csv("c:/house.csv", header=TRUE)
We will see the first 6 data entries and check if we have chosen the correct data set.
> head(mydata)
Let us check the min, mean, max, 1st and 3rd quartile
Material and Level are categorial showing the count
> summary (mydata)
Though R has individual statistic functions like "sd", "var" functions to obtain these values we will use a library package to obtain al these statistics.
> install.package("pastecs")
> library(pastecs)
We use the following function to obtain the comprehensive statitics.
> stat.desc(mydata)
Following statistics are shown:
nbr.val
nbr.null
nbr.na
min
max
range
sum
median
SE.mean
CI.mean.0.95
var
std.dev
coef.var

> attach(mydata)
> hist(Selling_Price)
Most of the houses are sold in 20 to 40 million

Subsetting the data
mydata[mydata$Selling_Price>80,]

aggregate(Selling_Price~Bedrooms_Cnt, data=mydata, FUN="mean")
aggregate(Selling_Price~Garrage_Cnt, data=mydata, FUN="mean")
cor(mydata[unlist(lapply(mydata,is.numeric))])
cor.test[Rooms_cnt,Selling_Price]
fit <-aov(Selling_Price~Level,data=mydata)
fit
plot(Selling_Price~Level)
fit < aov(Selling_Price~Material,data=mydata)
fit
summary(fit)


Case Study - Independent T tests

A researcher studying the effect of a drug on the growth of tumors whishes to find if the drug reduces the growth of tumors. The researcher obtains mice with tumors and divides them into two groups. One group of mice were injected with the drug, and other group was used as the control group. After one week, the tumors were weighed again and the results were noted as follows: The research question is, does the drug have an effect of reducing the tumor size?

Treated with Drug Not Treated with Drug
0.72 0.71
0.68 0.83
0.69 0.89
0.66 0.57
0.57 0.68
0.66 0.74
0.7 0.75
0.63 0.67
0.71 0.8
0.73 0.78

Since these two belong to two different groups of mice and have no dependencies with each other. We will use independent t tests.

tumordata <- read.csv("c:/tumor.csv", header=TRUE)
head(tumordata)
boxplot(tumordata)
attach(tumordata)
ttest(treatedwithdrug,nottreatedwithdrug, alternative="1")

      No comments:

      Post a Comment