Friday, November 15, 2013

More on Data Selection and Manipulation

###############More on Data Selection and Manipulation########
#Let's take a look at 'iris' dataset
iris
#R displays the dataset
#Lets see the name of variables
names(iris)


#Let's look at Petal.Length. You need to prepend the name of the dataset and a $

iris$Petal.Length

#How about Sepal.Length
iris$Sepal.Length

#if you don't want to prepend the name of the dataset and a $ evertime, do this
attach(iris)

#Now you can type only the name of the variable
Sepal.Length

#Let's look at other ways of retrieving the variables
iris[,1]

#R displays Sepal.Length which is the first variable of iris
iris[1,]
#R displays the first row of the dataset

#Let's see the length of the dataset
length(iris)
#Output is 5, which means there are 5 variables or columns

length(iris[,3])
#Output is 150, which means there are 150 data points in Petal.Length variable/column

length(iris[1,])
#Output is 5, which means there are data points in first row

#pairs(iris)
#Let's replicate the iris dataset
x=iris

#Display it
x

#x dataset is displayed. It is the same as iris
#calculate the mean of Sepal.Length
mean(x[,1])
#Output : 5.843333
sd(x[,1])
#Output :0.8280661
fivenum(x)
#####Output
#Error in x[floor(d)] + x[ceiling(d)] :
#  non-numeric argument to binary operator
#We need to remove the 'Species' column that contains string
x[,-5]  #This is how we need to remove a column
#R spits out all columns except 5th column

y=x[,-5]         #We create another dataset y from x after removing 5th column(Species)

#Missing data is indicated by NA in R. Till now there is no missing data in y dataset.
x[150,3]=NA      #Lets create one
fivenum(y)       #fivenum() Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
#Output: [1] 0.1 1.7 3.2 5.1 7.9
summary(y)  
#Output
 #Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
 #Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
 #1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
 #Median :5.800   Median :3.000   Median :4.300   Median :1.300
 #Mean   :5.843   Mean   :3.057   Mean   :3.749   Mean   :1.199
 #3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
 #Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
 #                                NA's   :1                  

#Pay attention to 3rd column. R points out in the last row that there is a missing value
#Mean of Petal.Length is calculated as 3.749. lets check it.

mean(x[,3])           #The output is: [1] NA. This means the NA value has to be removed while calculating mean
mean(x[,3],na.rm=T)   #The output is 3.748993. This same as was shown in summary function's output rounded off

#Let's do some filtering (conditional selection)

x$Sepal.Length[x$Sepal.Length>5]  #R displays those values that are greater than 5
#Try
a<-x$Sepal.Length[x$Sepal.Length>7.6]
a                     #Output: [1] 7.7 7.7 7.7 7.9 7.7
length(a)             #Display 5. This means there are 5 cases of x$Sepal.Length>7.6
cumsum(a)             #R calculates and diplays the cumulative sum of 'a' ##Output[1]  7.7 15.4 23.1 31.0 38.7

pdf("e:/r/cor.pdf")
plot(x, main="Scatter Plot")  #Graphical Output will be written to specified pdf file
pairs(x,main="Scatterplot by pairs function")
dev.off()             #Now graphical output will be displayed as before

No comments:

Post a Comment