Cluster Analysis is a fundamental modelling technique, which is all about grouping. The steps involved in clustering are valid for all techniques.
Here are the steps for Cluster Analysis:
1.Choose the Right Variable – The concept involves identifying what is the right attribute and how much is it worth it. Here, one must select a variable that one feels may be important for identifying and understanding differences among groups of observation within the data.
2.Scaling the Data – In this, the data samples from different sources may be grouped in different scales. For example, if we are working on personal data, such as age where it goes from 0 to 100, weight between 40-180 and height between 1-6 feet. Here, the variables in the analysis vary in range; the variable with the largest range will have the greatest impact on the results.
3.Calculate Distances- Here, if the variables in the analysis vary in range, the variable with the largest range will have the greatest impact on the results.
A Point to note is that each of the attributes has different scales. If we try to come out with an equation, then normalization must be considered, where we may have to bring all attributes and variables. For example, given that we are doing analysis on weather and evaluate the sample data from India & US, the scale is different in this case. This is because one would be using metric system and the other is using US system. Thus, our objective is to bring them to the same standard. Also, the basic purpose of Cluster Analysis is to calculate distances
Calculation of Distance between Points in a Cluster
Here, one objective can be to group similar points together into one cluster.
1) One way is that we can take the center of the cluster and find out the center of the next group and calculate distance between the centers.
2) Or take the closest point and find distance between closest points.
3) Or take the largest distance points and find out the distant between them.
Simple linkage – produces elongated clusters. It is the shortest distance between a point in one cluster and a point in the other cluster.
Complete linkage– longest distance between a point in one cluster and a point in the other cluster
Average linkage– average distance between each point in one cluster and each point in the other cluster
Centroid – distance between the centroids (mean vector over the variables) of the two clusters
Ward– combines clusters that lead to the smallest distance within clusters, sum of all squares over all variables
Note: These concepts may be applied to multiple techniques. In each and every technique we have multiple options to choose from. When it comes to cluster analysis, this is called as hierarchical cluster analysis, where one can use multiple methods. Each method has its own advantage, disadvantage and properties.
If you wish to learn Power BI and build a career in data visualization or BI, then check out our Power BI Certification Course which comes with instructor-led live training and real-life project experience. This training will help you understand Power BI in-depth and help you achieve mastery over the subject. Also, Take your career to the next level by mastering the skills required for business analysis. Enroll in our Business Analyst Course today and take the first step towards a fulfilling and lucrative career.
Got a question for us? Mention them in the comments section and we will get back to you.
Introduction to Business Analytics with R