dataminer: 2013

Saturday, November 16, 2013

One Way ANOVA

We perform t-test to test if two samples have the the same mean. More specifically, we test the null hypothesis that there is difference between the means.

Similarly, ANOVA tells us if the means of three or more samples are same or not. ANOVA is an omnibus test which means that it will tell us that means are same (or not same) but, it will not give you specific information (in the case when means are not same) about which means are not equal.

In this post we will perform ANOVA test on a dataset and find out where the difference lies.

Download the data here
setwd("e:/r") #I have kept the data file in this location
d <- read.csv("labs.csv")
d
boxplot(d)
s=stack(d) #This step is needed to prepare the data.
s #See how the arrangement of the data has changed
names(s) = c("measure","lab")
s
diff <- aov( measure~lab, data=s) #response variable (measure) comes first
summary(diff)

#We reject the assumption of no difference because the p-value suggests that
#there is a significant difference across the 3 labs.
#If there were no difference, the investigation would have ended here.

#As there is significant difference in this case, we need to find out where
#the difference lies. We need to perform pairwise comparison test. We have many
#options for this. One of them is Tukey's HSD test that gives us the intervals.
#We must remember that this test only works the design is balanced, in other
#words, data points for each lab must be same. In our case we have a blanced design.
#Run the test
tk <- TukeyHSD(diff)

#Output
# Tukey multiple comparisons of means
# 95% family-wise confidence level

#Fit: aov(formula = measure ~ lab, data = s)

#$lab
# diff lwr upr p adj
#lab2-lab1 1.75 -1.8842297 5.38423 0.4584177
#lab3-lab1 6.25 2.6157703 9.88423 0.0008182
#lab3-lab2 4.50 0.8657703 8.13423 0.0137160

#Each of the last 3 rows contain pairwise comparison results.
#Look at Row 1: diff column shows the mean of difference between lab2 and lab1 is 1.75
#lwr column provides the lower limit of difference at 95% confidence level
#upr column provides the upper limit of difference at 95% confidence level
#Last column: p adj gives the p-value 0.4584177; we cannot reject the assumption of no-difference between lab2 and lab1
#Row 2 and row3 indicates (p-value less than 0.05) that there are significant differences between lab3 & lab1; and lab3 & lab2
#Let's take a look at visual plot

plot(tk)

#Take a look at the plot. The dotted line reprents zero. Zero is within the limits of 95% confidence interval of the difference
#between lab2 and lab1 indicating that there is no significant difference between lab2 and lab1.
#But there are significant differences between lab3 & lab1, and lab3 & lab2.

Friday, November 15, 2013

More on Data Selection and Manipulation

###############More on Data Selection and Manipulation########
#Let's take a look at 'iris' dataset
iris
#R displays the dataset
#Lets see the name of variables
names(iris)

#Let's look at Petal.Length. You need to prepend the name of the dataset and a $

iris$Petal.Length

#How about Sepal.Length
iris$Sepal.Length

#if you don't want to prepend the name of the dataset and a $ evertime, do this
attach(iris)

#Now you can type only the name of the variable
Sepal.Length

#Let's look at other ways of retrieving the variables
iris[,1]

#R displays Sepal.Length which is the first variable of iris
iris[1,]
#R displays the first row of the dataset

#Let's see the length of the dataset
length(iris)
#Output is 5, which means there are 5 variables or columns

length(iris[,3])
#Output is 150, which means there are 150 data points in Petal.Length variable/column

length(iris[1,])
#Output is 5, which means there are data points in first row

#pairs(iris)
#Let's replicate the iris dataset
x=iris

#Display it
x

#x dataset is displayed. It is the same as iris
#calculate the mean of Sepal.Length
mean(x[,1])
#Output : 5.843333
sd(x[,1])
#Output :0.8280661
fivenum(x)
#####Output
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
#We need to remove the 'Species' column that contains string
x[,-5] #This is how we need to remove a column
#R spits out all columns except 5th column

y=x[,-5] #We create another dataset y from x after removing 5th column(Species)

#Missing data is indicated by NA in R. Till now there is no missing data in y dataset.
x[150,3]=NA #Lets create one
fivenum(y) #fivenum() Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
#Output: [1] 0.1 1.7 3.2 5.1 7.9
summary(y)
#Output
#Sepal.Length Sepal.Width Petal.Length Petal.Width
#Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#Median :5.800 Median :3.000 Median :4.300 Median :1.300
#Mean :5.843 Mean :3.057 Mean :3.749 Mean :1.199
#3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
# NA's :1

#Pay attention to 3rd column. R points out in the last row that there is a missing value
#Mean of Petal.Length is calculated as 3.749. lets check it.

mean(x[,3]) #The output is: [1] NA. This means the NA value has to be removed while calculating mean
mean(x[,3],na.rm=T) #The output is 3.748993. This same as was shown in summary function's output rounded off

#Let's do some filtering (conditional selection)

x$Sepal.Length[x$Sepal.Length>5] #R displays those values that are greater than 5
#Try
a<-x$Sepal.Length[x$Sepal.Length>7.6]
a #Output: [1] 7.7 7.7 7.7 7.9 7.7
length(a) #Display 5. This means there are 5 cases of x$Sepal.Length>7.6
cumsum(a) #R calculates and diplays the cumulative sum of 'a' ##Output[1] 7.7 15.4 23.1 31.0 38.7

pdf("e:/r/cor.pdf")
plot(x, main="Scatter Plot") #Graphical Output will be written to specified pdf file
pairs(x,main="Scatterplot by pairs function")
dev.off() #Now graphical output will be displayed as before

Test of normality

#######Test of normality

#generate a set of data which we know is normally distributed
x=rnorm(1000,17,3)

#Split the graphics window into 2
#par(mfrow=c(2,1),pty="s")
#Get a visual feel of normality
qqnorm(x)
qqline(x, col="#f00000")

#Following graphs are given for reference while interpreting qqplot

enter image description here

#Above are not tests, just visual feel. Let's do a test
shapiro.test(x)

#############Output####################

# Shapiro-Wilk normality test
#data: x
#W = 0.9983, p-value = 0.4133
#P-value is more than 0.05, so we cannot reject the assumption of normality

#Another test is Anderson-Darling Normality Test.
#Need to load 'nortest' library
library(nortest)

ad.test(x)
############output############
# Anderson-Darling normality test

#data: x
#A = 0.3321, p-value = 0.5109

#P-value is more than 0.05, so we cannot reject the assumption of normality

#Let's generate a dataset that we know is not normal
y=rf(1000,3,17)
#video file: "Assessing Normality in R_x264.mp4"
#Let's check it visually.
qqnorm(y)
qqline(y)

#The plot doesn't lie along a 45 degree line.
#So, the dataset does not appear to be normal. But it is not a test

shapiro.test(y)

#The p-value is much smaller than .05, or even .0001
#So, the assumption of normality has to be rejected. The dataset is not normal
plot(y)

Sunday, October 27, 2013

Nice tutorials to discover R

Nice tutorials to discover R http://t.co/ckBJskmpvK via @rbloggers
— Dilir Akhtar Khan (@dilirkhan) October 27, 2013

Normalize Data in R (Calculate Z scores)

scale() function is used to create Z scores (normalize) in R.

To calculate Z score of a variable, we subtract the mean of all data points from each individual data point and divide the result by standard deviation of the variable. scale() does this in one simple call.

In R console, type

> x = c(2,4,6,8)

This creates a variable x.

To subtract the mean of the variable from each data point (this is called centering):

> scale(x, center = TRUE, scale = TRUE) # scale = FALSE will not divide each data point by mean

> x

[,1]
[1,] -1.1618950
[2,] -0.3872983
[3,] 0.3872983
[4,] 1.1618950
attr(,"scaled:center")
[1] 5
attr(,"scaled:scale")
[1] 2.581989

Tuesday, October 22, 2013

T-Test in R

98.6 t-test.xlsx the file needs to be converted to .csv

http://ww2.coastal.edu/kingw/statistics/R-tutorials/singlesample.html

normtmp=read.csv(“e:/r/98.6 t-test.csv”,header=TRUE)
qqnorm(normtmp$tmp)
qqline(normtmp$tmp)
plot(density(normtmp$tmp))
shapiro.test(normtmp$tmp)
t.test(normtmp$tmp, mu=98.6, conf.level=.99, alternative=”two.sided”)
# output not shown

#Note: setting the alternative to “two.sided” was unnecessary, since that is the default.
We can now reject the null at any reasonable alpha level we might have chosen!
#From the sample, we might estimate the mean human body temperature to be 98.25 degrees (sample mean on the last line of output).
#A 99% CI lets us be 99% sure the population mean is between 98.08111 and 98.41735 degrees.

Friday, October 11, 2013

Different Types of Plots in R

To get the data set click this link : Friends Data from Carnegie Mellon University. data will be Data will be downloaded on your computer. Double click the downloaded file. A new session of R will start and data will be loaded in a variable named: friends.

To take a look at the data, type:
> friends

Create a table:
> t <- table(friends)

see the table:

> t

friends
No difference Opposite sex Same sex
602 434 164

> barplot(t)

Output:

> barplot(t, horiz=T)

Try
> barplot(t, horiz=T, main="Friends Distribution", ylab="Make Friends With", col="darkblue")

For more examples, check: http://www.statmethods.net/graphs/bar.html

Pie Chart
------------
> pie(t)

To create 3D pie chart:

> install.packages("plotrix")

>library(plotrix)

>pie3D(t, explode=.1)

Saturday, October 5, 2013

Chi Square Test

Copy the following data in a text editor, add a blank line at the end and save as chisq.csv.

Heart Rate Increased, No Heart Rate Increase
Treated, 36,14
Not Treated, 30, 25

For details on the data,visit http://math.hws.edu/javamath/ryan/ChiSquare.html

What we are trying to do here is to test the effect of a drug.
Ho: The proportion of animals whose heart rate increased is independent of drug treatment.
Ha: The proportion of animals whose heart rate increased is associated with drug treatment.

Read the data into R:
> x <- read.csv("e:/r/chisq.csv")

If you didn't enter a line at the end of the file, you are likely to get the following warning:

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'Chi_Square.csv'

However, lets run the test:

> chisq.test(x, correct=F)

Output:

Pearson's Chi-squared test

data: x

X-squared = 3.4177, df = 1, p-value = 0.0645

Look at the p-value.

p-value of 0.065 is greater than the conventionally accepted of p > 0.05 we fail to reject the null hypothesis. In other words, there is no statistically significant difference in the proportion of animals whose heart rate increased.

Friday, October 4, 2013

Notes

discrete data arise from a counting process, while continuous data arise from a measuring process.

Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.

Wednesday, October 2, 2013

R Video Link

http://www.twotorials.com/

Saturday, September 21, 2013

Topic 3: Objects, Data Types etc.

There are two aspects of R language which we need to understand: objects and functions.

Object

An Object can be thought of as a storage space for an associated name, for example:

> x <- 916
Here we have created an object which has stored the value 916. "<-" is the assignment operator in R. It is good to remember that everything is stored as an object in R.

Type x at R prompt:

> x

You get the following output,
[1] 916

The 1 within square brackets tell us that this is the first element in the x object (in this case the only element). As we shall see that an object can contain several elements. At that time the numbers within square brackets will be helpful.

Function
Function is a special type of R object designed to carry out some operation. Function usually takes some arguments and produce a result by means of executing some set of operations. R comes with a set of functions for our use, but we can create our own functions.

You can take a look at what objects are available in the current R session by typing:
> ls()

Since we have created one object, x, we get the following output from R:
[1] "x"

Objects you create stay in the memory until you delete them. You can delete object to free up memory by:
> rm(x)

Now type ls() to see the list of objects again. R outputs the following:

character(0).

Object names may consist of any upper- and lower-case letters, the digits 0 to 9 (except in the beginning of the name), and also the period, \.", which behaves like a letter. Note that names in R are case sensitive, meaning that Color and color are two distinct objects. This is a frequent cause of frustration for beginners who keep getting \object not found" errors. If you face this type of error, start by checking the correctness of the name of the object causing the error.

The most basic data object in R is a vector. When we create the object x, we created a vector with the value 916. Every object has a length and a mode.

The mode tells you the kind of data stored in the object. Vectors are used to store a set of elements
of the same atomic data type. The main atomic types are character, logical, numeric, or complex. Hence, you may have vectors of characters, logical values (T or F or FALSE or TRUE), numbers, and complex numbers.

Let's create another vector (object):
> y <- 1:10
> y
Output: [1] 1 2 3 4 5 6 7 8 9 10
> length(y)
Output: [1] 10
> mode(y)
Output: [1] "numeric"

All elements of a vector must be of same mode. Meaning, all elements must be of same type. Try the following:
> y <- c(1:5,"Hello")
> y
[1] "1" "2" "3" "4" "5" "Hello"

> mode(y)
[1] "character"

> length(y)
[1] 6

First of all we have created a using the c() function which combines the arguments to create a vector y. Within the c() function we used "1:5". This is just an alternative to typing 1,2,3,4,5.
Then we added another argument "Hello" (Character type). when we printed the elements of y, we got all the elements within double quotes. In the next line we checked the mode of the vector y. We got "Character". R has used type coercion. Since we provided a character type element, it converted all numeric elements to character type to maintain the integrity of the vector.

Point to remember: All elements of a vector must be of same mode.

We can refer to elements of a vector in the following way:
> y[1]
Output: [1] "1"

We can change elements in the following way:
> y[1] = "New Value"
> y
Output: [1] "New Value" "2" "3" "4" "5" "Hello"

As expected, the first element of the vector y has been changed to "New Value". You might have noticed that we have changed the assignment operator to "=" which just works fine. This time let's change the value the old way:

> y[2] <- "Another Value"
> y
Output: [1] "New Value" "Another Value" "3" "4" "5" "Hello"

We can perform all sorts of operations on vectors.

> vect = 1:10
> vect
Output: [1] 1 2 3 4 5 6 7 8 9 10
> vect = vect + 2
> vect
Output: [1] 3 4 5 6 7 8 9 10 11 12
You can see that every element has been incremented by 2.

> vect = vect * 2
> vect
Output: [1] 6 8 10 12 14 16 18 20 22 24

Every element of the vector has been multiplied by 2.

> vect = sqrt(vect)
> vect
Output: [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278

In the last example we have run the square root operation on all the elements of the vector "vect", and assigned them back to "vect".

Similarly, we can add 2 vectors:

> x = 1:10
> y = 11:20
> z = x + y
> z
Output: [1] 12 14 16 18 20 22 24 26 28 30

Missing Values:
Missing value is represented by NA in R

>x=1:10

Let's replace 9th element with NA
>x[9]=NA

>sum(x)
[1] NA

We need a way to tackle this by performing sum function excluding the NA
> sum(x, na.rm=T)
[1] 46

Try mean
> mean(x, na.rm=T)
[1] 5.111111

You can see the total has been divided by 9. This is what you want.

Factor
Another important aspect of R is factor. This provides a convenient way of handling categorical (nominal) variables. Factors have levels that determines the possible values the variable can take. Let see an example:

> coffee = c("cold", "right", "hot", "hot", "right", "cold")
> factor(coffee)

Output: [1] cold right hot hot right cold
Levels: cold hot right

> table(factor(coffee))

Output: cold hot right
2 2 2

We can use gl() function to generate sequences involving factors:
> lab=gl(3, 5, labels = c("child", "adult", "old"))
> lab
[1] child child child child child adult adult adult adult adult old old old old old
Levels: child adult old

gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)

Arguments

`n`	an integer giving the number of levels.
`k`	an integer giving the number of replications.
`length`	an integer giving the length of the result.
`labels`	an optional vector of labels for the resulting factor levels.
`ordered`	a logical indicating whether the result should be ordered or not.

> table(lab)
lab
child adult old
5 5 5

Generating Random Numbers
R has several functions that can be used to generate random sequences according to di erent probability density functions. The functions have the generic structure rfunc(n, par1, par2, ...), where func is the
name of the probability distribution, n is the number of data to generate, and par1, par2, ... are the values of some parameters of the density function that may be required. For instance, if you want ten randomly generated numbers from a normal distribution with zero mean and unit standard deviation,
type:

> rnorm(5)
[1] -0.4335585 -0.1092160 0.1082784 -0.5065135 -0.5878001

> rt(5, df=7)

[1] -1.07614048 -0.02142847 0.88955231 -1.42091564 1.29517603

Indexes
Will come back later

Arrays and Matrices
> mat = 1:20
> mat=matrix(mat,4,5)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

> mat=1:20
> mat = matrix(mat,2,10)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 3 5 7 9 11 13 15 17 19
[2,] 2 4 6 8 10 12 14 16 18 20
> rownames(mat)<-c("one","two")

> rownames(mat)
[1] "one" "two"

> mat["one",]
[1] 1 3 5 7 9 11 13 15 17 19

The same can be achieved with following command:
> mat[1,]
[1] 1 3 5 7 9 11 13 15 17 19

What if you want the 6th column?

> mat[,6]

ones twos

11 12

Arrays are similar to indexes but an array can have more than 2 dimensions.

List

List elements need not be of same mode or length.

> student = list(id=67,

+ name='Jamal',

+ marks = c(77,88,99)

+ )

Output:
> student
$id
[1] 67

$name
[1] "Jamal"

$marks
[1] 77 88 99

You may check the mode:
> mode(student)
Output:
[1] "list"

You may extract individual elements by using [n] notation where n is the subscript.
> student[1]
$id
[1] 67

R returns a list which is a sub-list of the list object 'student'. You can verify this:
> mode(student[1])
[1] "list"

In order to extract the value of 'id':
> student[[1]]
[1] 67

You have to use double square bracket alongwith the element subscript to extract the value.

You can verify the mode:
> mode(student[[1]])
[1] "numeric"

Try these:
> student[[2]]
[1] "Jamal"
> mode(student[[2]])
[1] "character"
> student[[3]]
[1] 77 88 99

> mode(student[[3]])

[1] "numeric"

Dataframe

Data frame is a versatile data object in R. Data frame object is like a spreadsheet. Each column of the data frame is a vector. All data elements in the column must be of the same mode. However, different vectors can be of different modes. All vectors in a data frame must be of the same length.

Create a dataframe:
> myset = data.frame(id = c(916,917,918), names = c("Dilir","Tr","Arif"))
> myset
id names
1 916 Dilir
2 917 Tr
3 918 Arif

You can refer to a variable in the following manner:
> myset$names
[1] Dilir Tr Arif
Levels: Arif Dilir Tr

You can use the table function as well.
> table(myset$names)

Arif Dilir Tr
1 1 1

> myset$id
[1] 916 917 918

Topic 2: Help and Finding Resources on R

To search for a topic, type:
> ? plot
This will bring up help topic on plot function.

If you are connected to the Internet, you can use the RSiteSearch() function that searches for key words or
phrases in the mailing list archives, R manuals, and help pages. For example, type the following:

> RSiteSearch('decision tree')

This will bring up tons of resources on Decision Tree.

Another place to look for help is: http://www.rseek.org

Customizing R

At the R command prompt, type
> options(prompt = "R> ", continue = " ")

This does 2 things, changes the prompt to R>. 2nd and subsequent lines of a multi-line command is not greeted with a beginning + symbol.

Decision Tree with R

For decision tree analysis we will use iris data set.

We need ctree() function to perform the analysis. This function is available in package "party".

Fire up R and check if "party" is available on your system. type:
> library("party")

If you don't get the following error, you are good to go:
Error in library("party") : there is no package called ‘party’
Else, you have to install party package by typing:
> install.packages("party")

This may take few minutes. Once done type the following at R prompt:
> library("party")

Then
> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
> iris_ctree
Yow will get the following output:
inference tree with 4 terminal nodes

Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46

Type:
> plot(iris_ctree)

You get the following output:

Again type:
> plot(iris_ctree, type="simple")

You get the following output:

Correlation Coefficient

Correlation coefficient is standardized covariance. In other words, unit less covariance.
Correlation is used to measure the size of an effect. Values of +/- .1 represent small effect, +/-0.3 medium effect, +/-0.5 large effect. (Page:112, Discovering Statistics with SPSS by Andy Field.)

Now, what is a covariance? Some other time, but let's see how simple it is to run a correlation test in R.

Let's use the iris dataset and plot it for a visual inspection:
>plot(iris$Petal.Length,iris$Petal.Width,main="Correlation: iris$Petal.Length,iris$Petal.Width")

We do see a relationship between Petal length and petal width. Now lets quantify that:
Now correlation:

> cor(iris$Petal.Length,iris$Petal.Width)

R spits out with one statistic: [1] 0.9628654
A correlation coefficient can must lie between -1 and +1 where quantity signifies the strength of relationship and sign represents the direction (+ or - ) of relationship. In the iris case, relationship is positive which means petal width increases as petal length increases. Strength of relationship is expressed by 0.96 which is very close to the limit, hence we say that the relationship is very strong.

So, we conclude: there is a strong positive correlation between petal length and width.

Friday, September 20, 2013

R : Packages

Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored on your computer is called the library. The function
.libPaths() will show you where your library is located, while the function library()
will show you what packages you have saved in your library.

R comes with a standard set of packages, while others are available for
download and installation. Once installed, they have to be loaded into the session in order to
be used. The command search() will tell you which packages are loaded and ready to use.

Getting data from standard normal distribution

creating a dataframe of 30 numbers from a standard normal dist.
> x=rnorm(30)

Check the shape of the distribution:

> plot(density(x))

See the output. Result will be different every time you use the rnorm function

http://www.ats.ucla.edu/stat/r/library/matrix_alg.htm

Monday, September 16, 2013

hold-out sample = testing data

Text classification : KNN is a good algorithm

R : Linear Regression

Grab the list of top 100 chess players from http://ratings.fide.com/toparc.phtml?cod=273

Save it as txt file. Read the data into R by typing at the R command prompt:

> chess=read.table(file.choose(), header=TRUE, sep="\t")

R will open up the file dialog. Choose the txt file from the location where you saved it.

Type the following to see the data
> chess

and the following to display the Rating variable
> chess$Rating

To create a plot:
> plot(chess$Rating)

To create a fitting line:
Create an x variable with 100 data points
> x=1:100

set y to chess$Rating to make it clean (not mandatory)

>y=chess$Rating

Now the line:
> abline(lm(y~x))

If you didn't create y,
> abline(lm(chess$Rating~x))

Sunday, September 15, 2013

Change the Working Directory in R

> setwd("e:/r")

See the current working directory:

> getwd()

How to Install a package in R

> install.packages('cars')

Topic 1: Introduction to R from a Newbie's Perspective

Install R
Download R from http://cran.r-project.org/bin/windows/
I downloaded this and installed on my Windows 7 PC. Working just fine

Start R and type the followin at the prompt:

> R.version

You get the following output: _
platform i386-w64-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16)
nickname Good Sport

Lets generate some data.
> x <- 1:100
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
[58] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

We may wish to edit the data. Fire up the data editor:

> data.entry(x)

Close the editor after you are done

`If you want to quit R, type`

> q()

In order to save the history and data objects:
> setwd("e:/R")
> savehistory(file="history-9-26-2013.Rhistory")

You can load the history with the following command:
> load("history-9-26-2013.Rhistory")

`Saving workspace image`

The option to save the workspace saves only the objects you have created, not any output you have produced using them. The option to save the workspace can be performed at any time using the save.image () command (also available as “Save Workspace” under the file menu)or at the end of a session, when R will ask you if you want to save the workspace.
type:
> save.image()

> installed.packages()

You can achieve a less complete list by typing:

> library()

To check whether there are newer versions of your installed packages at CRAN:

> old.packages()

You can use the following command to update all your installed

packages:

> update.packages()

`Sample Data in R`

Start R
R comes with sample data. To see what datasets are available, type the following at the R command prompt:
> data()

R shows the datasets available. Look at the following screenshot.

`Take a look at the data`

Type the following at R prompt:
> iris

R prints out the iris data

To see a summary of the data type the following at the R prompt:
> summary(iris)

R prints out the summary. Look at the following screenshot.

There are several variables in iris. You can look at the data by typing at the R prompt:
> iris[1]

R prints out the Sepal.Length variable.

Similarly, to see Sepal.Width, type:
> iris[2]

Another way of refering to a variable within a dataframe (what we have been calling dataset so far):
> iris$Sepal.Width

R prints out the data, this time horizontally.

Intuitively, you would refer to Sepal.Length as iris$Sepal.Length. Type this at R prompt, you get the data back.

To summarize iris$Sepal.Width, you can type:
> fivenum(iris$Sepal.Length)

Do not try:
> fivenum(iris[1])

To summarize Sepal.Width variable
> iris$Sepal.Width

To plot Sepal.Width
> plot(iris$Sepal.Width)

Histogram of iris$Sepal.Width
> plot(iris$Sepal.Width)


Check the shape of iris$Sepal.Width

> plot(density((iris$Sepal.Width)))

See the image : the shape of the variable

Other commands to explore:
> qqnorm(iris$Sepal.Width)

> qqline(iris$Sepal.Width)

Shapiro-Wilk normality testShapiro-Wilk normality test
> shapiro.test(iris$Sepal.Width)

> quantile(iris$Sepal.Width)

For Standard Deviation
> sd(iris$Sepal.Width)

For Variance
> var(iris$Sepal.Width)

To do a box plot of Sepal.Width
> boxplot(iris$Sepal.Width, ylab="Sepal Width (iris data)",
name="Sepal Width",

main="Sepal With Boxplot")

To see two boxplots of Sepal.Length and Sepal.Width side by side:
> boxplot(iris$Sepal.Length, iris$Sepal.Width, ylab="Sepal Length/Width
(iris data)",
names=c("Sepal Length", "Sepal Width"),
main="Sepal Length & Width Boxplot")

library(help = "datasets")

Package foreign

Also of note is an R package called foreign. This package contains functionality for importing data into R that is formatted by most other statistical software
packages, including SAS, SPSS, STRATA and others. Package foreign is available for download and installation from the CRAN site.