dataminer: September 2013

Saturday, September 21, 2013

Topic 3: Objects, Data Types etc.

There are two aspects of R language which we need to understand: objects and functions.

Object

An Object can be thought of as a storage space for an associated name, for example:

> x <- 916
Here we have created an object which has stored the value 916. "<-" is the assignment operator in R. It is good to remember that everything is stored as an object in R.

Type x at R prompt:

> x

You get the following output,
[1] 916

The 1 within square brackets tell us that this is the first element in the x object (in this case the only element). As we shall see that an object can contain several elements. At that time the numbers within square brackets will be helpful.

Function
Function is a special type of R object designed to carry out some operation. Function usually takes some arguments and produce a result by means of executing some set of operations. R comes with a set of functions for our use, but we can create our own functions.

You can take a look at what objects are available in the current R session by typing:
> ls()

Since we have created one object, x, we get the following output from R:
[1] "x"

Objects you create stay in the memory until you delete them. You can delete object to free up memory by:
> rm(x)

Now type ls() to see the list of objects again. R outputs the following:

character(0).

Object names may consist of any upper- and lower-case letters, the digits 0 to 9 (except in the beginning of the name), and also the period, \.", which behaves like a letter. Note that names in R are case sensitive, meaning that Color and color are two distinct objects. This is a frequent cause of frustration for beginners who keep getting \object not found" errors. If you face this type of error, start by checking the correctness of the name of the object causing the error.

The most basic data object in R is a vector. When we create the object x, we created a vector with the value 916. Every object has a length and a mode.

The mode tells you the kind of data stored in the object. Vectors are used to store a set of elements
of the same atomic data type. The main atomic types are character, logical, numeric, or complex. Hence, you may have vectors of characters, logical values (T or F or FALSE or TRUE), numbers, and complex numbers.

Let's create another vector (object):
> y <- 1:10
> y
Output: [1] 1 2 3 4 5 6 7 8 9 10
> length(y)
Output: [1] 10
> mode(y)
Output: [1] "numeric"

All elements of a vector must be of same mode. Meaning, all elements must be of same type. Try the following:
> y <- c(1:5,"Hello")
> y
[1] "1" "2" "3" "4" "5" "Hello"

> mode(y)
[1] "character"

> length(y)
[1] 6

First of all we have created a using the c() function which combines the arguments to create a vector y. Within the c() function we used "1:5". This is just an alternative to typing 1,2,3,4,5.
Then we added another argument "Hello" (Character type). when we printed the elements of y, we got all the elements within double quotes. In the next line we checked the mode of the vector y. We got "Character". R has used type coercion. Since we provided a character type element, it converted all numeric elements to character type to maintain the integrity of the vector.

Point to remember: All elements of a vector must be of same mode.

We can refer to elements of a vector in the following way:
> y[1]
Output: [1] "1"

We can change elements in the following way:
> y[1] = "New Value"
> y
Output: [1] "New Value" "2" "3" "4" "5" "Hello"

As expected, the first element of the vector y has been changed to "New Value". You might have noticed that we have changed the assignment operator to "=" which just works fine. This time let's change the value the old way:

> y[2] <- "Another Value"
> y
Output: [1] "New Value" "Another Value" "3" "4" "5" "Hello"

We can perform all sorts of operations on vectors.

> vect = 1:10
> vect
Output: [1] 1 2 3 4 5 6 7 8 9 10
> vect = vect + 2
> vect
Output: [1] 3 4 5 6 7 8 9 10 11 12
You can see that every element has been incremented by 2.

> vect = vect * 2
> vect
Output: [1] 6 8 10 12 14 16 18 20 22 24

Every element of the vector has been multiplied by 2.

> vect = sqrt(vect)
> vect
Output: [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278

In the last example we have run the square root operation on all the elements of the vector "vect", and assigned them back to "vect".

Similarly, we can add 2 vectors:

> x = 1:10
> y = 11:20
> z = x + y
> z
Output: [1] 12 14 16 18 20 22 24 26 28 30

Missing Values:
Missing value is represented by NA in R

>x=1:10

Let's replace 9th element with NA
>x[9]=NA

>sum(x)
[1] NA

We need a way to tackle this by performing sum function excluding the NA
> sum(x, na.rm=T)
[1] 46

Try mean
> mean(x, na.rm=T)
[1] 5.111111

You can see the total has been divided by 9. This is what you want.

Factor
Another important aspect of R is factor. This provides a convenient way of handling categorical (nominal) variables. Factors have levels that determines the possible values the variable can take. Let see an example:

> coffee = c("cold", "right", "hot", "hot", "right", "cold")
> factor(coffee)

Output: [1] cold right hot hot right cold
Levels: cold hot right

> table(factor(coffee))

Output: cold hot right
2 2 2

We can use gl() function to generate sequences involving factors:
> lab=gl(3, 5, labels = c("child", "adult", "old"))
> lab
[1] child child child child child adult adult adult adult adult old old old old old
Levels: child adult old

gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)

Arguments

`n`	an integer giving the number of levels.
`k`	an integer giving the number of replications.
`length`	an integer giving the length of the result.
`labels`	an optional vector of labels for the resulting factor levels.
`ordered`	a logical indicating whether the result should be ordered or not.

> table(lab)
lab
child adult old
5 5 5

Generating Random Numbers
R has several functions that can be used to generate random sequences according to di erent probability density functions. The functions have the generic structure rfunc(n, par1, par2, ...), where func is the
name of the probability distribution, n is the number of data to generate, and par1, par2, ... are the values of some parameters of the density function that may be required. For instance, if you want ten randomly generated numbers from a normal distribution with zero mean and unit standard deviation,
type:

> rnorm(5)
[1] -0.4335585 -0.1092160 0.1082784 -0.5065135 -0.5878001

> rt(5, df=7)

[1] -1.07614048 -0.02142847 0.88955231 -1.42091564 1.29517603

Indexes
Will come back later

Arrays and Matrices
> mat = 1:20
> mat=matrix(mat,4,5)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

> mat=1:20
> mat = matrix(mat,2,10)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 3 5 7 9 11 13 15 17 19
[2,] 2 4 6 8 10 12 14 16 18 20
> rownames(mat)<-c("one","two")

> rownames(mat)
[1] "one" "two"

> mat["one",]
[1] 1 3 5 7 9 11 13 15 17 19

The same can be achieved with following command:
> mat[1,]
[1] 1 3 5 7 9 11 13 15 17 19

What if you want the 6th column?

> mat[,6]

ones twos

11 12

Arrays are similar to indexes but an array can have more than 2 dimensions.

List

List elements need not be of same mode or length.

> student = list(id=67,

+ name='Jamal',

+ marks = c(77,88,99)

+ )

Output:
> student
$id
[1] 67

$name
[1] "Jamal"

$marks
[1] 77 88 99

You may check the mode:
> mode(student)
Output:
[1] "list"

You may extract individual elements by using [n] notation where n is the subscript.
> student[1]
$id
[1] 67

R returns a list which is a sub-list of the list object 'student'. You can verify this:
> mode(student[1])
[1] "list"

In order to extract the value of 'id':
> student[[1]]
[1] 67

You have to use double square bracket alongwith the element subscript to extract the value.

You can verify the mode:
> mode(student[[1]])
[1] "numeric"

Try these:
> student[[2]]
[1] "Jamal"
> mode(student[[2]])
[1] "character"
> student[[3]]
[1] 77 88 99

> mode(student[[3]])

[1] "numeric"

Dataframe

Data frame is a versatile data object in R. Data frame object is like a spreadsheet. Each column of the data frame is a vector. All data elements in the column must be of the same mode. However, different vectors can be of different modes. All vectors in a data frame must be of the same length.

Create a dataframe:
> myset = data.frame(id = c(916,917,918), names = c("Dilir","Tr","Arif"))
> myset
id names
1 916 Dilir
2 917 Tr
3 918 Arif

You can refer to a variable in the following manner:
> myset$names
[1] Dilir Tr Arif
Levels: Arif Dilir Tr

You can use the table function as well.
> table(myset$names)

Arif Dilir Tr
1 1 1

> myset$id
[1] 916 917 918

Topic 2: Help and Finding Resources on R

To search for a topic, type:
> ? plot
This will bring up help topic on plot function.

If you are connected to the Internet, you can use the RSiteSearch() function that searches for key words or
phrases in the mailing list archives, R manuals, and help pages. For example, type the following:

> RSiteSearch('decision tree')

This will bring up tons of resources on Decision Tree.

Another place to look for help is: http://www.rseek.org

Customizing R

At the R command prompt, type
> options(prompt = "R> ", continue = " ")

This does 2 things, changes the prompt to R>. 2nd and subsequent lines of a multi-line command is not greeted with a beginning + symbol.

Decision Tree with R

For decision tree analysis we will use iris data set.

We need ctree() function to perform the analysis. This function is available in package "party".

Fire up R and check if "party" is available on your system. type:
> library("party")

If you don't get the following error, you are good to go:
Error in library("party") : there is no package called ‘party’
Else, you have to install party package by typing:
> install.packages("party")

This may take few minutes. Once done type the following at R prompt:
> library("party")

Then
> iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
> iris_ctree
Yow will get the following output:
inference tree with 4 terminal nodes

Response: Species
Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
Number of observations: 150

1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
2)* weights = 50
1) Petal.Length > 1.9
3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
5)* weights = 46
4) Petal.Length > 4.8
6)* weights = 8
3) Petal.Width > 1.7
7)* weights = 46

Type:
> plot(iris_ctree)

You get the following output:

Again type:
> plot(iris_ctree, type="simple")

You get the following output:

Correlation Coefficient

Correlation coefficient is standardized covariance. In other words, unit less covariance.
Correlation is used to measure the size of an effect. Values of +/- .1 represent small effect, +/-0.3 medium effect, +/-0.5 large effect. (Page:112, Discovering Statistics with SPSS by Andy Field.)

Now, what is a covariance? Some other time, but let's see how simple it is to run a correlation test in R.

Let's use the iris dataset and plot it for a visual inspection:
>plot(iris$Petal.Length,iris$Petal.Width,main="Correlation: iris$Petal.Length,iris$Petal.Width")

We do see a relationship between Petal length and petal width. Now lets quantify that:
Now correlation:

> cor(iris$Petal.Length,iris$Petal.Width)

R spits out with one statistic: [1] 0.9628654
A correlation coefficient can must lie between -1 and +1 where quantity signifies the strength of relationship and sign represents the direction (+ or - ) of relationship. In the iris case, relationship is positive which means petal width increases as petal length increases. Strength of relationship is expressed by 0.96 which is very close to the limit, hence we say that the relationship is very strong.

So, we conclude: there is a strong positive correlation between petal length and width.

Friday, September 20, 2013

R : Packages

Packages are collections of R functions, data, and compiled code in a well-defined format.
The directory where packages are stored on your computer is called the library. The function
.libPaths() will show you where your library is located, while the function library()
will show you what packages you have saved in your library.

R comes with a standard set of packages, while others are available for
download and installation. Once installed, they have to be loaded into the session in order to
be used. The command search() will tell you which packages are loaded and ready to use.

Getting data from standard normal distribution

creating a dataframe of 30 numbers from a standard normal dist.
> x=rnorm(30)

Check the shape of the distribution:

> plot(density(x))

See the output. Result will be different every time you use the rnorm function

http://www.ats.ucla.edu/stat/r/library/matrix_alg.htm

Monday, September 16, 2013

hold-out sample = testing data

Text classification : KNN is a good algorithm

R : Linear Regression

Grab the list of top 100 chess players from http://ratings.fide.com/toparc.phtml?cod=273

Save it as txt file. Read the data into R by typing at the R command prompt:

> chess=read.table(file.choose(), header=TRUE, sep="\t")

R will open up the file dialog. Choose the txt file from the location where you saved it.

Type the following to see the data
> chess

and the following to display the Rating variable
> chess$Rating

To create a plot:
> plot(chess$Rating)

To create a fitting line:
Create an x variable with 100 data points
> x=1:100

set y to chess$Rating to make it clean (not mandatory)

>y=chess$Rating

Now the line:
> abline(lm(y~x))

If you didn't create y,
> abline(lm(chess$Rating~x))

Sunday, September 15, 2013

Change the Working Directory in R

> setwd("e:/r")

See the current working directory:

> getwd()

How to Install a package in R

> install.packages('cars')

Topic 1: Introduction to R from a Newbie's Perspective

Install R
Download R from http://cran.r-project.org/bin/windows/
I downloaded this and installed on my Windows 7 PC. Working just fine

Start R and type the followin at the prompt:

> R.version

You get the following output: _
platform i386-w64-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 3
minor 0.1
year 2013
month 05
day 16
svn rev 62743
language R
version.string R version 3.0.1 (2013-05-16)
nickname Good Sport

Lets generate some data.
> x <- 1:100
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
[58] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

We may wish to edit the data. Fire up the data editor:

> data.entry(x)

Close the editor after you are done

`If you want to quit R, type`

> q()

In order to save the history and data objects:
> setwd("e:/R")
> savehistory(file="history-9-26-2013.Rhistory")

You can load the history with the following command:
> load("history-9-26-2013.Rhistory")

`Saving workspace image`

The option to save the workspace saves only the objects you have created, not any output you have produced using them. The option to save the workspace can be performed at any time using the save.image () command (also available as “Save Workspace” under the file menu)or at the end of a session, when R will ask you if you want to save the workspace.
type:
> save.image()

> installed.packages()

You can achieve a less complete list by typing:

> library()

To check whether there are newer versions of your installed packages at CRAN:

> old.packages()

You can use the following command to update all your installed

packages:

> update.packages()

`Sample Data in R`

Start R
R comes with sample data. To see what datasets are available, type the following at the R command prompt:
> data()

R shows the datasets available. Look at the following screenshot.

`Take a look at the data`

Type the following at R prompt:
> iris

R prints out the iris data

To see a summary of the data type the following at the R prompt:
> summary(iris)

R prints out the summary. Look at the following screenshot.

There are several variables in iris. You can look at the data by typing at the R prompt:
> iris[1]

R prints out the Sepal.Length variable.

Similarly, to see Sepal.Width, type:
> iris[2]

Another way of refering to a variable within a dataframe (what we have been calling dataset so far):
> iris$Sepal.Width

R prints out the data, this time horizontally.

Intuitively, you would refer to Sepal.Length as iris$Sepal.Length. Type this at R prompt, you get the data back.

To summarize iris$Sepal.Width, you can type:
> fivenum(iris$Sepal.Length)

Do not try:
> fivenum(iris[1])

To summarize Sepal.Width variable
> iris$Sepal.Width

To plot Sepal.Width
> plot(iris$Sepal.Width)

Histogram of iris$Sepal.Width
> plot(iris$Sepal.Width)


Check the shape of iris$Sepal.Width

> plot(density((iris$Sepal.Width)))

See the image : the shape of the variable

Other commands to explore:
> qqnorm(iris$Sepal.Width)

> qqline(iris$Sepal.Width)

Shapiro-Wilk normality testShapiro-Wilk normality test
> shapiro.test(iris$Sepal.Width)

> quantile(iris$Sepal.Width)

For Standard Deviation
> sd(iris$Sepal.Width)

For Variance
> var(iris$Sepal.Width)

To do a box plot of Sepal.Width
> boxplot(iris$Sepal.Width, ylab="Sepal Width (iris data)",
name="Sepal Width",

main="Sepal With Boxplot")

To see two boxplots of Sepal.Length and Sepal.Width side by side:
> boxplot(iris$Sepal.Length, iris$Sepal.Width, ylab="Sepal Length/Width
(iris data)",
names=c("Sepal Length", "Sepal Width"),
main="Sepal Length & Width Boxplot")

library(help = "datasets")

Package foreign

Also of note is an R package called foreign. This package contains functionality for importing data into R that is formatted by most other statistical software
packages, including SAS, SPSS, STRATA and others. Package foreign is available for download and installation from the CRAN site.