dataminer: More on Data Selection and Manipulation

###############More on Data Selection and Manipulation########
#Let's take a look at 'iris' dataset
iris
#R displays the dataset
#Lets see the name of variables
names(iris)

#Let's look at Petal.Length. You need to prepend the name of the dataset and a $

iris$Petal.Length

#How about Sepal.Length
iris$Sepal.Length

#if you don't want to prepend the name of the dataset and a $ evertime, do this
attach(iris)

#Now you can type only the name of the variable
Sepal.Length

#Let's look at other ways of retrieving the variables
iris[,1]

#R displays Sepal.Length which is the first variable of iris
iris[1,]
#R displays the first row of the dataset

#Let's see the length of the dataset
length(iris)
#Output is 5, which means there are 5 variables or columns

length(iris[,3])
#Output is 150, which means there are 150 data points in Petal.Length variable/column

length(iris[1,])
#Output is 5, which means there are data points in first row

#pairs(iris)
#Let's replicate the iris dataset
x=iris

#Display it
x

#x dataset is displayed. It is the same as iris
#calculate the mean of Sepal.Length
mean(x[,1])
#Output : 5.843333
sd(x[,1])
#Output :0.8280661
fivenum(x)
#####Output
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
#We need to remove the 'Species' column that contains string
x[,-5] #This is how we need to remove a column
#R spits out all columns except 5th column

y=x[,-5] #We create another dataset y from x after removing 5th column(Species)

#Missing data is indicated by NA in R. Till now there is no missing data in y dataset.
x[150,3]=NA #Lets create one
fivenum(y) #fivenum() Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
#Output: [1] 0.1 1.7 3.2 5.1 7.9
summary(y)
#Output
#Sepal.Length Sepal.Width Petal.Length Petal.Width
#Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#Median :5.800 Median :3.000 Median :4.300 Median :1.300
#Mean :5.843 Mean :3.057 Mean :3.749 Mean :1.199
#3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# NA's :1

#Pay attention to 3rd column. R points out in the last row that there is a missing value
#Mean of Petal.Length is calculated as 3.749. lets check it.

mean(x[,3]) #The output is: [1] NA. This means the NA value has to be removed while calculating mean
mean(x[,3],na.rm=T) #The output is 3.748993. This same as was shown in summary function's output rounded off

#Let's do some filtering (conditional selection)

x$Sepal.Length[x$Sepal.Length>5] #R displays those values that are greater than 5
#Try
a<-x$Sepal.Length[x$Sepal.Length>7.6]
a #Output: [1] 7.7 7.7 7.7 7.9 7.7
length(a) #Display 5. This means there are 5 cases of x$Sepal.Length>7.6
cumsum(a) #R calculates and diplays the cumulative sum of 'a' ##Output[1] 7.7 15.4 23.1 31.0 38.7

pdf("e:/r/cor.pdf")
plot(x, main="Scatter Plot") #Graphical Output will be written to specified pdf file
pairs(x,main="Scatterplot by pairs function")
dev.off() #Now graphical output will be displayed as before

dataminer

Friday, November 15, 2013

More on Data Selection and Manipulation

No comments:

Post a Comment