Data Science and Machine Learning Internship ...
- 22k Enrolled Learners
- Weekend/Weekday
- Live Class
Before we proceed with analysis of the bank data using R, let me give a quick introduction to R. R is a well-defined integrated suite of software for data manipulation, calculation and graphical display.
We call R an environment within which many classical and modern statistical techniques have been integrated. There are about 25 packages supplied with R and around more than 3000 are available through the Comprehensive R Archive Network (CRAN) family of Internet sites (via http://CRAN.R-project.org) and elsewhere.
Though R is a great software, but it isn’t the right tool for every problem. You should know your problem and limitations of R well before you use it. R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer’s memory. It’s not as good at storing data in complicated structures, efficiently querying data, or working with data that doesn’t fit in the computer’s memory.
That’s the reason I used Hadoop to pre-process large files before moving them into R. Though you can do the same task in R as well using R regular expression and binding operations but it would be quite new and complex for a beginner. Also, R uses the RAM to hold dataset. You can’t therefore hold large values in R if you do not have a large memory. To hold large data files, I usually use a database like MySQL, or a framework like Hadoop.
Capability
There are thousands of statistical and data analysis algorithms in R. None of its counterparts offers this many variety in functionality that is available through the CRAN.
Community
There are million users of R worldwide and they are growing exponentially due to its capabilities. You can always share your knowledge, doubts or suggestion with them through various forums.
Performance
R’s performance is excellent compared to other commercial analysis packages. R loads datasets into memory before processing. The only thing you should have is a good configuration machine to use its functionality to maximum extent. I think everyone can now go for higher memory machines as memories are quite cheap today than the time when R was developed. That’s probably one of the greatest reasons why R users are growing at this pace.
The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:
Step 1: Loading and developing an understanding of the data
We import the files that we saved in HDFS in Hadoop from Pig named “combined_out”
## setting Hadoop variables for Hadoop in R environment
Sys.setenv(JAVA_HOME="/home/abhay/java") Sys.setenv(HADOOP_HOME="/home/abhay/hadoop") Sys.setenv(HADOOP_CMD="/home/abhay/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/home/abhay/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")
##loading RHadoop packages
library(rmr2) library(rhdfs) hdfs.init()
##setting Hadoop root path and reading files from HDFS
hdfs.root <- '/bank_project' hdfs.data <- file.path(hdfs.root, 'combined_out/part-r-00000') final_bank_data <- hdfs.read.text.file(hdfs.data) content<-hdfs.read.text.file(hdfs.data) clickpath<-read.table(textConnection(content),sep=",")
Step 2 : Creating a target dataset
##naming all the columns fetched from HDFS
colnames(clickpath) <- c("ac_id","disposal_type","age","sex","card_type","dist","avg_sal","unemp_rate","entrepreneur_no","trans_sum","loan_amount","loan_status")
Step 3 : Data cleaning and pre-processing
##checking structure of the fetched data
str(clickpath)
##list of rows with missing values
clickpath[!complete.cases(clickpath),]
##list of columns with missing values
clickpath[,!complete.cases(clickpath)]
## If any missing values are there omit them
clickpath <- na.omit(clickpath,na.action=TRUE)
Step 4 : Data reduction and projection
##selecting only numerical data and removing ac_id column
mydata <- clickpath[,c(3,7:11)]
# First check the complete set of components for outliers
## As we can see from the above plots that avg_sal,unemp_rate and loan amount has some outliers in the data. Let’s analyze all these three individually.
# outlier in avg_sal
# Since avg_sal is one of the most useful things which we can’t simply ignore without much investigation. Hence we would check the scatterplot of this entry for more clarity.
plot(mydata[,c(2)])
## from this plot we observe that there are so many entries whose values are in the outlier category. Hence, it would not be a good idea to remove this outlier.
#outlier in unemp_rate
## From this graph we can see that there are a few entries which lie as outliers. Since this value is not that necessary, hence we can decide to reduce outliers from this entry. There may be many ways of removal of outliers. I chose to replace outliers with the maximum values which is ~1.5.
## defining function to replace outliers
library(data.table) outlierReplace = function(dataframe, cols, rows, newValue = NA) { if (any(rows)) { set(dataframe, rows, cols, newValue) } }
#calling the outlier Replace function for entry unemp_rate to replace all the outliers with maximum category value.
outlierReplace(clickpath, "unemp_rate", which(mydata$unemp_rate > 1.5), 1.5)
## now checking the five num summary of the entry to verify if the outliers has been replaced.
fivenum(mydata$unemp_rate)
#outlier in loan_amount
## from this plot we observe that there are so many entries whose values are in the outlier category. Considering sensitivity of this entry it would not be a good idea to remove this outlier.
##Since the data attributes are of different varieties their scales are also different. In order to maintain uniform scalability we scale the columns.
mydata <- scale(mydata[,1:7])
Step 5 : Choosing the data mining task
## Calculating variance and storing at the first index in wss
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
Step 6 : Choosing the data mining algorithm(s)
##We are going to use k-means algorithm for this clustering.
Clustering analysis??
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure.
K-Means Algorithm Properties
There are always K clusters.
There is always at least one item in each cluster.
The clusters are non-hierarchical and they do not overlap.
Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the ‘center’ of clusters.
The K-Means Algorithm Process
The choice of initial partition can greatly affect the final clusters that result in terms of inter-cluster and intra-cluster distances and cohesion.
## iterate through wss array 15 times and sum up all the variance in every iteration and store it in wss array
for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)
## plot each iteration to display the elbow graph
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")
Step 7 : Searching for patterns of interest in a particular representational form
##As we can see from the above output the slope of the graph changes majorly in 3 iteration, hence we consider the optimized number of cluster as 3 in which we can get the optimum result
fit <- kmeans(mydata,3)
## Let’s check the summary of the kmeans objects
kmeans returns an object of class which has a print and a fitted method. It is a list with at least the following components:
cluster | A vector of integers (from 1:k) indicating the cluster to which each point is allocated. |
centers | A matrix of cluster centres. |
totss | The total sum of squares. |
withinss | Vector of within-cluster sum of squares, one component per cluster. |
tot.withinss | Total within-cluster sum of squares, i.e. sum(withinss) . |
betweenss | The between-cluster sum of squares, i.e. totss-tot.withinss . |
size | The number of points in each cluster. |
iter | The number of (outer) iterations. |
ifault | integer: indicator of a possible algorithm problem – for experts. |
## checking withinss i.e. the intra cluster bond strength factor for each cluster
fit$withinss
## checking betweenss i.e. the inter cluster distance between cluster
fit$betweenss
fit$size
Step 8 : Interpreting mined patterns
plot(mydata,col=fit$cluster,pch=15) points(fit$centers,col=1:8,pch=3)
library(cluster) library(fpc) plotcluster(mydata,fit$cluster) points(fit$centers,col=1:8,pch=16)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
## checking mean for each object in each cluster
Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are.
mydata <- clickpath[,c(3,7:12)] mydata <- data.frame(mydata,fit$cluster) cluster_mean <- aggregate(mydata[,1:8],by = list(fit$cluster),FUN = mean) cluster_mean
K-means clustering in particular when using heuristics such as Lloyd’s algorithm is rather easy to implement and apply even on large datasets. It has been successfully used in various fields, including market segmentation, computer vision, geostatistics, astronomy and agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.
In the next blog we would learn about application of logistic regression and market basket analysis on bank data.
Got a question for us? Please mention it in the comments section and we will get back to you.
Related Posts:
Implementing Hadoop and R Analytic Skills in Banking Domain
edureka.co
how to start doing kaggle projects let me know plz. i have already done with all ml models. i need to explore more and how to start doing it in r tool.abhay help.. with this. topic.
Do you have the dataset used for this analysis? and could you provide the same.
hey! can you please provide a python code for this??
Hey Shivani, thanks for checking out our blog. Sorry to say that we do not have Python code for this blog. Do check out our other blogs too. Cheers!
It depends upon the objective of your project??If you share the use case i can help.
i did work with hadoop for kmeans (i hv taken text based unstured data)…and i got clusters …..then how i proceed my project plz help me…………….
Hello! Can you please give us more details about your project and where you need the help. Our technical team is waiting to assist you!
Great tutorial Abhay. Is there any change we can get the dataset CLICKPATH to follow the steps one by one? It would be great. But thanks again for sharing.
I cant install package “rmrr2” and “rhdfs” on R 3.2.0. Error message is ”
Warning in install.packages :package ‘rmr2’ is not available (for R version 3.2.0)”. please help ?”
Hi Abhay, Thanks for sharing! The codes and explanations are neat! Just have one question, I am dealing with data on excel with more than 50k rows. I tried kmean, hierarchical clustering, model based clustering analysis. Only kmean work out because of the large data size. However, clusters by kmean cannot really show differentiation between clusters. So I am wondering is there any other way to do clustering analysis? Thanks!
Hello,
@Shawn,Kmeans is used for numerci variable
you shoud use hierchical clustring
hope that help you