Data Science and Machine Learning Internship ...
- 22k Enrolled Learners
- Weekend/Weekday
- Live Class
In this blog, you will understand what is K-means clustering and how it can be implemented on the criminal data collected in various US states. The data contains crimes committed like: assault, murder, and rape in arrests per 100,000 residents in each of the 50 US states in 1973. Along with analyzing the data you will also learn about:
Let us start with the analysis. The data looks as:
Need this dataset? Click on the above image to download it.
First let’s prepare the data for the analysis. In order to do so, we should remove any NA values that might be present in the data and convert the data into a matrix.
> crime0 <- na.omit(USArrests) > crime <- data.matrix (crime0) > str(crime) num [1:50, 1:4] 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
Let us take the number of clusters to be 5. Kmeans() function takes the input data and the number of clusters in which the data is to be clustered. The syntax is : kmeans( data, k) where k is the number of cluster centers.
> cl <- kmeans(crime, 5) > class(cl) [1] "kmeans"
Analyzing the Clustering :
> str(cl) List of 9 $ cluster : Named int [1:50] 5 3 3 5 3 5 4 5 3 5 ... ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... $ centers : num [1:5, 1:4] 2.95 6.11 12.14 5.59 11.3 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:5] "1" "2" "3" "4" ... .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape" $ totss : num 355808 $ withinss : num [1:5] 4548 2286 16272 1480 3653 $ tot.withinss: num 28240 $ betweenss : num 327568 $ size : int [1:5] 10 9 14 10 7 $ iter : int 3 $ ifault : int 0 - attr(*, "class")= chr "kmeans"
The str() function gives the structure of the kmeans which includes various parameters like withinss, betweenss, etc, analyzing which you can find out the performance of kmeans.
betweenss : Between sum of squares i.e. Intracluster similarity
withinss : Within sum of square i.e. Intercluster similarity
totwithinss : Sum of all the withinss of all the clusters i.e.Total intra-cluster similarity
A good clustering, will have a lower value of withinss and higher value of betweenss which depends on the number of clusters ‘k’ chosen initially. Let us see how we can find the optimal value of ‘k’.
An optimal value of ‘k’ is the value which gives us a converged set of clusters with minimum distortion. Greater the distortion, worse will be the clusters formed.
Distortion:
The distortion can be calculated in terms of ‘withinss’ from each of the clusters. Lesser the value of ‘withinss’ of a particular cluster, more densely populated it will be, thus minimum distortion.
kmeans.wss.k <- function(crime, k){ km = kmeans(crime, k) return (km$tot.withinss) }
This function takes up the data and the value of k and returns the ‘km$totwithinss’ for it. ‘km$totwithinss’ is the total within-cluster sum of squares, thus including withinss of all the 5 clusters created i.e. sum(withinss)
. Higher the value of ‘km$totwithinss’, greater will be the distortion.
For k=5, withinss is 24417.02
> kmeans.wss.k(crime,5) [1] 24417.02
Let’s increase the value of k from 5 to 10, and observe the difference.
> kmeans.wss.k(crime,10) [1] 11083.04
It can be seen that as the value of K increases, distortion decreases.
We can take out the different values of ‘km$totwithinss’ and plot them in a graph to find the relationship between distortion and the value of k. The following function does that for us:
> kmeans.dis <- function(crime, maxk){ + dis=(nrow(crime)-1)*sum(apply(crime,2,var)) + dis[2:maxk]=sapply (2:maxk, kmeans.wss.k, crime=crime) + return(dis) + } > maxk = 10 > dis = kmeans.dis(crime, maxk); > plot(1:maxk, dis, type='b', xlab="Number of Clusters", + ylab="Distortion", + col="blue")
Ta Da!!! Thus we have the famous elbow curve with us.
Elbow Curve:
This is the plot between ‘k’, the number of clusters and the ‘totwithinss’ (or distortion) for each value of k. You can see when the number of cluster is less, there is a gradual decrease in distortion but as we keep on increasing the value of k, the rate of reduction of distortion values becomes constant.
This value of k beyond which the distortion rate becomes constant is the optimal value. Here k=4.
Let us apply some animation to understand how R gave us the clustered results.
> library(animation) > cl<- kmeans.ani(crime, 4)
Kmeans clustering Algorithm:
Let us understand the algorithm on which k-means clustering works:
Step #1. If k=4, we select 4 random points and assume them to be cluster centers for the clusters to be created.
Step #2. We take up a random data point from the space and find out its distance from all the 4 clusters centers. If the data point is closest to the green cluster center, it is colored green and similarly all the points are categorised among the 4 clusters.
Step #3. Now we calculate the centroid of all the green points and assign that point as the cluster center for that cluster.
Similarly, we calculate centroids for all the 4 colored(clustered) points and assign the new centroids as the cluster centers.
Step #4. Step-2 and step-3 are run iteratively, unless the cluster centers converge at a point and no longer move.
Thus, we reach the converged clusters centers.
It can be seen that the data is divided into 4 clusters. The cluster centers are :
> cl$centers Murder Assault UrbanPop Rape Texas 4.740741 104.8519 62.96296 16.10 Louisiana 10.907143 219.9286 71.71429 25.95 South Carolina 13.375000 284.5000 46.25000 25.05 New Mexico 11.040000 298.0000 77.60000 32.68
Cluster-4 with ‘New Mexico’ as the cluster center has a huge crime rate with the highest population as well.
Cluster-3 and Cluster-2 follow up.
Each state is assigned a cluster, depending on which we can now predict its crime ranking. The output looks as :
Got a question for us? Please mention it in the comments section and we will get back to you.
Related Posts:
edureka.co
Hello!
how can I assign the clusters according to the results ?
sir i cant find the code for animations in ur blog can u please naviagate me to the animation code.
Hey Goutham, thanks for checking out our blog. For cluster animation, you need to import package named “animation” into the R studio. The following is the code to do cluster animation.
>install.packages(“animation”)
>library(animation)
> cl<- kmeans.ani(crime, 4)
You will be able to see the animation running in the plot window of your R Studio console.
Hope this helps. Cheers!
I’m getting the code works completely correct for my data set till distortion but I’m stuck to do assignment of cluster after that.. from k means algorithm can you please provide instructions how to implement it..
So the provided dataset is a labeled one with cluster ID. Am i right? We can use it for calculating the accuracy of the clustering
Hey Satish, thanks for checking out our blog.
The clustering accuracy can be determined by the animation chart of K-means clustering using the ‘animation” package which shows the clustering process. If the clusters groups or centres aren’t overlapping with each other we can conclude the clustering accuracy.
We also can measure the accuracy of the new labelling by comparing it with the original labeling (original labelling is the ground truth) i.e using a table function.
Hope this helps. Cheers!
how we can know characteristics for each cluster?
Hey Septianusa, thanks for checking out or tutorial! In order to determine the characteristics of each cluster, you will have to analyse the variables separately for each cluster.
One of the things you can do is calculate mean/median of the variables for every cluster and then determine which variable has a higher value in which cluster.
Or you can build classification models to understand characteristics of clusters.
For cluster kk, build a classification model with two classes: one is ‘k’; the other is others’others’. The training data consists of data points in cluster k and data points randomly sampled from all other clusters except cluster k.By examining the variable importance of the classification model, you can tell which features have bigger impact on the classification performance, in other words, which features can distinguish better class k from class others others better.
Hope this helps. Cheers!
Where we can download this dataset ?
click on the image with sample data set (at the top).
There is a error in the function (kmeans.dis) code in line 3,
it should be “sapply (2:maxk, kmeans.wss.k, train3=train3)” instead of “sapply (2:maxk, kmeans.dis.k, train3=train3)”
Function “kmeans.wss.k” is the one which is created earlier and same has to be used here.
Please correct me if i am wrong ?
Thanks for highlighting this Vishwa. We have made the necessary changes.