R Iris Dataset
Iris DataSet
Iris DataSet The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Predicted attribute: class of iris plant. http://archive.ics.uci.edu/ml/datasets/Iris
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Including Plots
Scatter Plot
plot(x=iris$Petal.Length, y=iris$Petal.Width, col=iris$Species)
We could use the pch argument (plot character) for specify the marking of Species. pch=21 is for filled circles, pch=22 for filled squares, pch=23 for filled diamonds, pch=24 or pch=25 for up/down triangles.
plot(iris$Petal.Length, iris$Petal.Width, pch=c(23,24,25)[unclass(iris$Species)], main="Iris Data")
or we can add color to them
plot(iris$Petal.Length, iris$Petal.Width,pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Data")
Scatter Plot Matrix
This shows the possible two-dimensional projections of multidimensional data
pairs(iris[1:4], col=iris$Species, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
From the plot the Petal.Length and Petal.Width can be best used for classifying the Species.
This shows the possible two-dimensional projections of multidimensional data. We can print the correlation coeff and p value in the top panel.
panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
# p-value calculation
p <- cor.test(x, y)$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.4, txt2)
}
pairs(iris[1:4], col=iris$Species, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)],
upper.panel=panel.cor)
Using SVM for Classifying the Species
library("e1071")
index <- 1:nrow(iris)
testindex <- sample(index, trunc(length(index)/4))
testset <- iris[testindex,]
trainset <- iris[-testindex,]
svm_model <- svm(Species ~ ., data=trainset)
summary(svm_model)
##
## Call:
## svm(formula = Species ~ ., data = trainset)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 44
##
## ( 8 18 18 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
Predict the testdata with model
prediction <- predict(svm_model,testset[,-5])
tab <- table(pred = prediction, true = testset[,5])
tab
## true
## pred setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 2
## virginica 0 0 15
Tuning SVM for best cost and gamma
svm_tune <- tune.svm(Species ~ ., data=trainset,
kernel="radial", cost=10^(-1:2), gamma=c(.5,1,2))
print(svm_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.5 1
##
## - best performance: 0.05378788
Run SVM after tuning
svm_model_after_tune <- svm(Species ~ ., data=iris, kernel="radial", cost=1, gamma=0.5)
summary(svm_model_after_tune)
##
## Call:
## svm(formula = Species ~ ., data = iris, kernel = "radial", cost = 1,
## gamma = 0.5)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 59
##
## ( 11 23 25 )
##
##
## Number of Classes: 3
##
## Levels:
## setosa versicolor virginica
Predict the testdata with model
prediction <- predict(svm_model_after_tune,testset[,-5])
tab <- table(pred = prediction, true = testset[,5])
tab
## true
## pred setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 1
## virginica 0 0 16
K Means clustering
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
## K-means clustering with 3 clusters of sizes 50, 52, 48
##
## Cluster means:
## Petal.Length Petal.Width
## 1 1.462000 0.246000
## 2 4.269231 1.342308
## 3 5.595833 2.037500
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3
## [141] 3 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 2.02200 13.05769 16.29167
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Tabulate the clusters by species
table(irisCluster$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 4
## 3 0 2 46
From the table we can detect the mis-classification
Plot the clusters
cluster <- as.factor(irisCluster$cluster)
plot(iris$Petal.Length, iris$Petal.Width,pch=21, bg=c("red","green3","blue")[unclass(cluster)], main="Iris Data")
You can compare it with the plot of the original data.
Using RandomForest for classification
library(randomForest)
model.rf <- randomForest(Species ~ . , data=trainset)
model.rf
##
## Call:
## randomForest(formula = Species ~ ., data = trainset)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.31%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 39 0 0 0.00000000
## versicolor 0 38 3 0.07317073
## virginica 0 3 30 0.09090909
pred.rf <- predict(model.rf,testset[,-5])
tab <- table(pred = pred.rf, true = testset[,5])
tab
## true
## pred setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 2
## virginica 0 0 15
tuning Random Forest
tuneRF(trainset[,-5],trainset[,5],ntreeTry=100,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)
## mtry = 2 OOB error = 5.31%
## Searching left ...
## Searching right ...
## mtry = 3 OOB error = 5.31%
## 0 0.01
## mtry OOBError
## 2.OOB 2 0.05309735
## 3.OOB 3 0.05309735
model.rf.tune <-randomForest(Species ~.,data=trainset, mtry=2, ntree=1000,
keep.forest=TRUE, importance=TRUE,test=testset)
model.rf.tune
##
## Call:
## randomForest(formula = Species ~ ., data = trainset, mtry = 2, ntree = 1000, keep.forest = TRUE, importance = TRUE, test = testset)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.31%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 39 0 0 0.00000000
## versicolor 0 38 3 0.07317073
## virginica 0 3 30 0.09090909
pred.rf.tune <- predict(model.rf.tune,testset[,-5])
tab <- table(pred = pred.rf.tune, true = testset[,5])
tab
## true
## pred setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 2
## virginica 0 0 15
Neural Network
library(caret)
library(nnet)
model <- train(Species ~ ., trainset, method='nnet', trace = FALSE) # train
# we also add parameter 'preProc = c("center", "scale"))' at train() for centering and scaling the data
model
## Neural Network
##
## 113 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 113, 113, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa Accuracy SD Kappa SD
## 1 0e+00 0.8049272 0.6910480 0.16023075 0.26135095
## 1 1e-04 0.8423737 0.7573686 0.19609549 0.29686460
## 1 1e-01 0.9244082 0.8856883 0.04276652 0.06368417
## 3 0e+00 0.8907742 0.8302890 0.10494926 0.16537702
## 3 1e-04 0.9548395 0.9311542 0.03295088 0.04937277
## 3 1e-01 0.9568830 0.9342232 0.02642218 0.03999543
## 5 0e+00 0.9343500 0.9005662 0.06765619 0.10017960
## 5 1e-04 0.9511903 0.9256171 0.03516218 0.05280885
## 5 1e-01 0.9568830 0.9342232 0.02642218 0.03999543
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.
prediction <- predict(model, testset[-5]) # predict
tab <- table(prediction, testset[,5]) # compare
tab
##
## prediction setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 1
## virginica 0 0 16
Tuning the nnet for parameter size. Train function from caret package is a good starting point for this
my.grid <- expand.grid(.decay = c(0.5, 0.1), .size = c(2,3,4,5))
model.nnet.tune <- train(Species ~ ., trainset, method='nnet',maxit = 1000, tuneGrid = my.grid, trace = F)
model.nnet.tune
## Neural Network
##
## 113 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 113, 113, 113, 113, 113, 113, ...
## Resampling results across tuning parameters:
##
## decay size Accuracy Kappa Accuracy SD Kappa SD
## 0.1 2 0.9596507 0.9386234 0.03293291 0.05011685
## 0.1 3 0.9587417 0.9371725 0.03207521 0.04877295
## 0.1 4 0.9597173 0.9386446 0.03296815 0.05012780
## 0.1 5 0.9597173 0.9386446 0.03296815 0.05012780
## 0.5 2 0.9465429 0.9189847 0.04205641 0.06292420
## 0.5 3 0.9479592 0.9212251 0.04220599 0.06335961
## 0.5 4 0.9507754 0.9254962 0.03779758 0.05668289
## 0.5 5 0.9496569 0.9236362 0.03818369 0.05755772
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 4 and decay = 0.1.
prediction <- predict(model.nnet.tune, testset[-5]) # predict
tab <- table(prediction, testset[,5]) # compare
tab
##
## prediction setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 1
## virginica 0 0 16
Generalised Boosted Models
This is available as the method gbm in caret package
fitControl <- trainControl(method="repeatedcv",
number=5,
repeats=1,
verboseIter=TRUE)
set.seed(25)
model.gbm <- train(Species ~ ., data=iris,
method="gbm",
trControl=fitControl,
verbose=FALSE)
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## - Fold1.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## - Fold2.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## - Fold3.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## - Fold4.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=1, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150
## + Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## - Fold5.Rep1: shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, n.trees=150
## Aggregating results
## Selecting tuning parameters
## Fitting n.trees = 50, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10 on full training set
model.gbm
## Stochastic Gradient Boosting
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
## 1 50 0.9533333 0.93 0.05055250 0.07582875
## 1 100 0.9533333 0.93 0.03800585 0.05700877
## 1 150 0.9533333 0.93 0.03800585 0.05700877
## 2 50 0.9400000 0.91 0.04944132 0.07416198
## 2 100 0.9533333 0.93 0.03800585 0.05700877
## 2 150 0.9400000 0.91 0.04944132 0.07416198
## 3 50 0.9666667 0.95 0.02357023 0.03535534
## 3 100 0.9600000 0.94 0.02788867 0.04183300
## 3 150 0.9533333 0.93 0.02981424 0.04472136
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
## = 3, shrinkage = 0.1 and n.minobsinnode = 10.
prediction <- predict(model.gbm, testset[-5])
tab <- table(prediction, testset[,5])
tab
##
## prediction setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 9 0
## virginica 0 0 17