dataminer: November 2013

Saturday, November 16, 2013

One Way ANOVA

We perform t-test to test if two samples have the the same mean. More specifically, we test the null hypothesis that there is difference between the means.

Similarly, ANOVA tells us if the means of three or more samples are same or not. ANOVA is an omnibus test which means that it will tell us that means are same (or not same) but, it will not give you specific information (in the case when means are not same) about which means are not equal.

In this post we will perform ANOVA test on a dataset and find out where the difference lies.

Download the data here
setwd("e:/r") #I have kept the data file in this location
d <- read.csv("labs.csv")
d
boxplot(d)
s=stack(d) #This step is needed to prepare the data.
s #See how the arrangement of the data has changed
names(s) = c("measure","lab")
s
diff <- aov( measure~lab, data=s) #response variable (measure) comes first
summary(diff)

#We reject the assumption of no difference because the p-value suggests that
#there is a significant difference across the 3 labs.
#If there were no difference, the investigation would have ended here.

#As there is significant difference in this case, we need to find out where
#the difference lies. We need to perform pairwise comparison test. We have many
#options for this. One of them is Tukey's HSD test that gives us the intervals.
#We must remember that this test only works the design is balanced, in other
#words, data points for each lab must be same. In our case we have a blanced design.
#Run the test
tk <- TukeyHSD(diff)

#Output
# Tukey multiple comparisons of means
# 95% family-wise confidence level

#Fit: aov(formula = measure ~ lab, data = s)

#$lab
# diff lwr upr p adj
#lab2-lab1 1.75 -1.8842297 5.38423 0.4584177
#lab3-lab1 6.25 2.6157703 9.88423 0.0008182
#lab3-lab2 4.50 0.8657703 8.13423 0.0137160

#Each of the last 3 rows contain pairwise comparison results.
#Look at Row 1: diff column shows the mean of difference between lab2 and lab1 is 1.75
#lwr column provides the lower limit of difference at 95% confidence level
#upr column provides the upper limit of difference at 95% confidence level
#Last column: p adj gives the p-value 0.4584177; we cannot reject the assumption of no-difference between lab2 and lab1
#Row 2 and row3 indicates (p-value less than 0.05) that there are significant differences between lab3 & lab1; and lab3 & lab2
#Let's take a look at visual plot

plot(tk)

#Take a look at the plot. The dotted line reprents zero. Zero is within the limits of 95% confidence interval of the difference
#between lab2 and lab1 indicating that there is no significant difference between lab2 and lab1.
#But there are significant differences between lab3 & lab1, and lab3 & lab2.

Friday, November 15, 2013

More on Data Selection and Manipulation

###############More on Data Selection and Manipulation########
#Let's take a look at 'iris' dataset
iris
#R displays the dataset
#Lets see the name of variables
names(iris)

#Let's look at Petal.Length. You need to prepend the name of the dataset and a $

iris$Petal.Length

#How about Sepal.Length
iris$Sepal.Length

#if you don't want to prepend the name of the dataset and a $ evertime, do this
attach(iris)

#Now you can type only the name of the variable
Sepal.Length

#Let's look at other ways of retrieving the variables
iris[,1]

#R displays Sepal.Length which is the first variable of iris
iris[1,]
#R displays the first row of the dataset

#Let's see the length of the dataset
length(iris)
#Output is 5, which means there are 5 variables or columns

length(iris[,3])
#Output is 150, which means there are 150 data points in Petal.Length variable/column

length(iris[1,])
#Output is 5, which means there are data points in first row

#pairs(iris)
#Let's replicate the iris dataset
x=iris

#Display it
x

#x dataset is displayed. It is the same as iris
#calculate the mean of Sepal.Length
mean(x[,1])
#Output : 5.843333
sd(x[,1])
#Output :0.8280661
fivenum(x)
#####Output
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
#We need to remove the 'Species' column that contains string
x[,-5] #This is how we need to remove a column
#R spits out all columns except 5th column

y=x[,-5] #We create another dataset y from x after removing 5th column(Species)

#Missing data is indicated by NA in R. Till now there is no missing data in y dataset.
x[150,3]=NA #Lets create one
fivenum(y) #fivenum() Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
#Output: [1] 0.1 1.7 3.2 5.1 7.9
summary(y)
#Output
#Sepal.Length Sepal.Width Petal.Length Petal.Width
#Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#Median :5.800 Median :3.000 Median :4.300 Median :1.300
#Mean :5.843 Mean :3.057 Mean :3.749 Mean :1.199
#3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# NA's :1

#Pay attention to 3rd column. R points out in the last row that there is a missing value
#Mean of Petal.Length is calculated as 3.749. lets check it.

mean(x[,3]) #The output is: [1] NA. This means the NA value has to be removed while calculating mean
mean(x[,3],na.rm=T) #The output is 3.748993. This same as was shown in summary function's output rounded off

#Let's do some filtering (conditional selection)

x$Sepal.Length[x$Sepal.Length>5] #R displays those values that are greater than 5
#Try
a<-x$Sepal.Length[x$Sepal.Length>7.6]
a #Output: [1] 7.7 7.7 7.7 7.9 7.7
length(a) #Display 5. This means there are 5 cases of x$Sepal.Length>7.6
cumsum(a) #R calculates and diplays the cumulative sum of 'a' ##Output[1] 7.7 15.4 23.1 31.0 38.7

pdf("e:/r/cor.pdf")
plot(x, main="Scatter Plot") #Graphical Output will be written to specified pdf file
pairs(x,main="Scatterplot by pairs function")
dev.off() #Now graphical output will be displayed as before

Test of normality

#######Test of normality

#generate a set of data which we know is normally distributed
x=rnorm(1000,17,3)

#Split the graphics window into 2
#par(mfrow=c(2,1),pty="s")
#Get a visual feel of normality
qqnorm(x)
qqline(x, col="#f00000")

#Following graphs are given for reference while interpreting qqplot

enter image description here

#Above are not tests, just visual feel. Let's do a test
shapiro.test(x)

#############Output####################

# Shapiro-Wilk normality test
#data: x
#W = 0.9983, p-value = 0.4133
#P-value is more than 0.05, so we cannot reject the assumption of normality

#Another test is Anderson-Darling Normality Test.
#Need to load 'nortest' library
library(nortest)

ad.test(x)
############output############
# Anderson-Darling normality test

#data: x
#A = 0.3321, p-value = 0.5109

#P-value is more than 0.05, so we cannot reject the assumption of normality

#Let's generate a dataset that we know is not normal
y=rf(1000,3,17)
#video file: "Assessing Normality in R_x264.mp4"
#Let's check it visually.
qqnorm(y)
qqline(y)

#The plot doesn't lie along a 45 degree line.
#So, the dataset does not appear to be normal. But it is not a test

shapiro.test(y)

#The p-value is much smaller than .05, or even .0001
#So, the assumption of normality has to be rejected. The dataset is not normal
plot(y)