KNN Algorithm using Python | K Nearest Neighbors Algorithm

Mastering Python (91 Blogs) Become a Certified Professional

With the business world entirely revolving around Data Science, it has become one of the most sort after fields. Hence, the heavy demand for a Data Science Certification. In this blog on KNN algorithm, you will understand how the KNN algorithm works and how it can be implemented by using Python.

What is KNN Algorithm?

K nearest neighbors or KNN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.

kNN is often used in search applications where you are looking for similar items, like find items similar to this one.

Algorithm suggests that if you’re similar to your neighbours, then you are one of them. For example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.

How does a KNN Algorithm work?

The k-nearest neighbors algorithm uses a very simple approach to perform classification. When tested with a new example, it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k-training examples) to the test example.

KNN Algorithm using Python | Edureka

This Edureka video on KNN Algorithm will help you to build your base by covering the theoretical, mathematical and implementation part of the KNN algorithm in Python.

Subscribe to our YouTube channel to stay updated with our fresh content

What does ‘k’ in kNN Algorithm represent?

k in kNN algorithm represents the number of nearest neighbor points which are voting for the new test data’s class.

If k=1, then test examples are given the same label as the closest example in the training set.

If k=3, the labels of the three closest classes are checked and the most common (i.e., occurring at least twice) label is assigned, and so on for larger ks.

kNN Algorithm Manual Implementation

Let’s consider this example,

Suppose we have height and weight and its corresponding Tshirt size of several customers. Your task is to predict the T-shirt size of Anna, whose height is 161cm and her weight is 61kg.

Step1: Calculate the Euclidean distance between the new point and the existing points

For example, Euclidean distance between point P1(1,1) and P2(5,4) is:

Step 2: Choose the value of K and select K neighbors closet to the new point.

In this case, select the top 5 parameters having least Euclidean distance

Step 3: Count the votes of all the K neighbors / Predicting Values

Since for K = 5, we have 4 Tshirts of size M, therefore according to the kNN Algorithm, Anna of height 161 cm and weight, 61kg will fit into a Tshirt of size M.

Implementation of kNN Algorithm using Python

Handling the data
Calculate the distance
Find k nearest point
Predict the class

Check the accuracy

Don’t just read it, practise it!

Step 1: Handling the data

The very first step will be handling the iris dataset. Open the dataset using the open function and read the data lines with the reader function available under the csv module.

[python]
import csv
with open(r’C:UsersAtul HarshaDocumentsiris.data.txt’) as csvfile:
lines = csv.reader(csvfile)
for row in lines:
print (‘, ‘.join(row))
[/python]

Now you need to split the data into a training dataset (for making the prediction) and a testing dataset (for evaluating the accuracy of the model).

Before you continue, convert the flower measures loaded as strings to numbers. Next, randomly split the dataset into train and test dataset. Generally, a standard ratio of 67/33 is used for test/train split

Adding it all, let’s define a function handleDataset which will load the CSV when provided with the exact filename and splits it randomly into train and test datasets using the provided split ratio.

[python]
import csv
import random
def handleDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, ‘r’) as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
[/python]

Let’s check the above function and see if it is working fine,

Testing handleDataset function

[python]
trainingSet=[]
testSet=[]
handleDataset(r’iris.data.’, 0.66, trainingSet, testSet)
print (‘Train: ‘ + repr(len(trainingSet)))
print (‘Test: ‘ + repr(len(testSet)))
[/python]

Step 2: Calculate the distance

In order to make any predictions, you have to calculate the distance between the new point and the existing points, as you will be needing k closest points.

In this case for calculating the distance, we will use the Euclidean distance. This is defined as the square root of the sum of the squared differences between the two arrays of numbers

Specifically, we need only first 4 attributes(features) for distance calculation as the last attribute is a class label. So for one of the approach is to limit the Euclidean distance to a fixed length, thereby ignoring the final dimension.

Summing it up let’s define euclideanDistance function as follows:

[python]
import math
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] – instance2[x]), 2)
return math.sqrt(distance)
[/python]

Testing the euclideanDistance function,

[python]
data1 = [2, 2, 2, ‘a’]
data2 = [4, 4, 4, ‘b’]
distance = euclideanDistance(data1, data2, 3)
print (‘Distance: ‘ + repr(distance))
[/python]

Step 3: Find k nearest point

Now that you have calculated the distance from each point, we can use it collect the k most similar points/instances for the given test data/instance.

This is a straightforward process: Calculate the distance wrt all the instance and select the subset having the smallest Euclidean distance.

Let’s create a getKNeighbors function that returns k most similar neighbors from the training set for a given test instance

[python]
import operator
def getKNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
[/python]

Testing getKNeighbors function

[python]
trainSet = [[2, 2, 2, ‘a’], [4, 4, 4, ‘b’]]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)
[/python]

Step 4: Predict the class

Now that you have the k nearest points/neighbors for the given test instance, the next task is to predicted response based on those neighbors

You can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Let’s create a getResponse function for getting the majority voted response from a number of neighbors.

[python]
import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
[/python]

Testing getResponse function

[python]
neighbors = [[1,1,1,’a’], [2,2,2,’a’], [3,3,3,’b’]]
print(getResponse(neighbors))
[/python]

Step 5: Check the accuracy

Now that we have all of the pieces of the kNN algorithm in place. Let’s check how accurate our prediction is!

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made.

Let’s create a getAccuracy function which sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

[python]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] is predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
[/python]

Testing getAccuracy function

[python]
testSet = [[1,1,1,’a’], [2,2,2,’a’], [3,3,3,’b’]]
predictions = [‘a’, ‘a’, ‘a’]
accuracy = getAccuracy(testSet, predictions)
print(accuracy)
[/python]

Since we have created all the pieces of the KNN algorithm, let’s tie them up using the main function.

[python]
# Example of kNN implemented from Scratch in Python

import csv
import random
import math
import operator

def handleDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, ‘rb’) as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split: trainingSet.append(dataset[x]) else: testSet.append(dataset[x]) def euclideanDistance(instance1, instance2, length): distance = 0 for x in range(length): distance += pow((instance1[x] – instance2[x]), 2) return math.sqrt(distance) def getNeighbors(trainingSet, testInstance, k): distances = [] length = len(testInstance)-1 for x in range(len(trainingSet)): dist = euclideanDistance(testInstance, trainingSet[x], length) distances.append((trainingSet[x], dist)) distances.sort(key=operator.itemgetter(1)) neighbors = [] for x in range(k): neighbors.append(distances[x][0]) return neighbors def getResponse(neighbors): classVotes = {} for x in range(len(neighbors)): response = neighbors[x][-1] if response in classVotes: classVotes[response] += 1 else: classVotes[response] = 1 sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedVotes[0][0] def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): # prepare data trainingSet=[] testSet=[] split = 0.67 loadDataset(‘iris.data’, split, trainingSet, testSet) print ‘Train set: ‘ + repr(len(trainingSet)) print ‘Test set: ‘ + repr(len(testSet)) # generate predictions predictions=[] k = 3 for x in range(len(testSet)): neighbors = getNeighbors(trainingSet, testSet[x], k) result = getResponse(neighbors) predictions.append(result) print(‘> predicted=’ + repr(result) + ‘, actual=’ + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print(‘Accuracy: ‘ + repr(accuracy) + ‘%’)

main()
[/python]

This was all about the kNN Algorithm using python. In case you are still left with a query, don’t hesitate in adding your doubt to the blog’s comment section.

Data Science Roadmap: How to Become a Data Scientist in 2025

This guide provides a comprehensive understanding of the essential skills and knowledge required to become a successful data scientist, covering data manipulation, programming, mathematics, big data, deep learning, and machine learning technologies. It emphasizes the importance of reporting strategies, data visualization tools, domain expertise, and lifelong learning for a fulfilling career in data science. Table […]

Python Pandas Tutorial : Learn Pandas for Data Analysis

In this blog on ‘Python Pandas Tutorial’, we will dive deep into data analytics using the Pandas library in Python. Python Programming is a skill trending over other more prominent programming languages like Java, C++, and C#. But before we talk about Pandas, let’s start by understanding the concept of NumPy arrays. Why? Because Pandas […]

Learn Python Tutorial for Beginners

Python is a programming language that is used for general-purpose programming. It is a high-level, interpreted, and object-oriented language. Python has been around since 1991 and has become one of the most popular languages in the world. This language is an easy-to-learn language with a simple syntax and can be learned by anyone who knows […]

Time Series Forecasting: Mastering Techniques and Applications

Time series forecasting is a powerful tool that allows us to predict future data points by analyzing trends, patterns, and seasonal variations in historical data. Whether you’re looking to anticipate sales, forecast stock prices, or predict weather patterns, mastering time series forecasting techniques can provide valuable insights and improve decision-making. This time series forecasting article […]

Data Science Modeling: Key Steps and Best Practices

In data science, modeling is the process of utilizing data to make mathematical representations of real-world processes. Algorithms are used to data at this critical stage of the info science pipeline to seek out patterns, forecast outcomes, or obtain insights. Data scientists will use data-driven evidence to unravel complicated issues and make well-informed judgments by […]

7 In-Demand Data Analyst Skills to Get You Hired in 2025

Data analysts are in high demand across various industries. In today’s data-driven world, businesses require data analysts to help them make sense of the vast amounts of data they collect. However, with the increasing demand for data analysts, the competition for available jobs is getting steeper. Therefore, it’s essential to have a strong set of […]

Data Analytics Projects: 9 Project Ideas for Your Portfolio

Data analytics is the process of analyzing, interpreting, and presenting data in a meaningful way. In today’s data-driven world, data analytics plays a critical role in helping businesses make informed decisions. As a data analytics professional, building a strong portfolio of projects is essential to showcase your skills and expertise to potential employers. This article […]

Fundamentals Of Statistics For Data Analytics

As data becomes increasingly essential to business decision-making, data scientists and analysts need to understand the fundamentals of statistics to make sense of data and extract valuable insights. This article will provide an introduction to the fundamentals of statistics for data analytics and data scientists. What are Statistics for Data Analytics? Statistics is a branch […]

Top 11 Programming Languages for Data Scientists in 2025

In this digital transformation era, data is at the heart of decision-making. Data science has gained prominence, playing a crucial role in deriving insights from vast volumes of data. Aspiring data scientists must familiarize themselves with the best programming languages in their field. Programming Languages for Data Scientists Here are the top 11 programming languages […]

Introduction to Python

Python Installation

Python Fundamentals

Python OOPs

Python Libraries

Web Scraping

Django

Python Programs

Career Oppurtunities

Interview Questions

Data Science

K-Nearest Neighbors Algorithm Using Python

What is KNN Algorithm?

How does a KNN Algorithm work?

KNN Algorithm using Python | Edureka

Subscribe to our YouTube channel to stay updated with our fresh content

What does ‘k’ in kNN Algorithm represent?

kNN Algorithm Manual Implementation

Step1: Calculate the Euclidean distance between the new point and the existing points

Step 2: Choose the value of K and select K neighbors closet to the new point.

Step 3: Count the votes of all the K neighbors / Predicting Values

Implementation of kNN Algorithm using Python

Step 1: Handling the data

Testing handleDataset function

Step 2: Calculate the distance

Step 3: Find k nearest point

Testing getKNeighbors function

Step 4: Predict the class

Testing getResponse function

Step 5: Check the accuracy

Testing getAccuracy function

Recommended videos for you

Linear Regression With R

3 Scenarios Where Predictive Analytics is a Must

Data Science : Make Smarter Business Decisions

Mastering Python : An Excellent tool for Web Scraping and Data Analysis

Know The Science Behind Product Recommendation With R Programming

Python List, Tuple, String, Set And Dictonary – Python Sequences

The Whys and Hows of Predictive Modeling-II

Application of Clustering in Data Science Using Real-Time Examples

Python Tutorial – All You Need To Know In Python Programming

Machine Learning with Python

Python Programming – Learn Python Programming From Scratch

Business Analytics with R

Business Analytics Decision Tree in R

Python for Big Data Analytics

Android Development : Using Android 5.0 Lollipop

The Whys and Hows of Predictive Modelling-I

Sentiment Analysis In Retail Domain

Python Loops – While, For and Nested Loops in Python Programming

Diversity Of Python Programming

Python Classes – Python Programming Tutorial

Recommended blogs for you

The Best Python Libraries For Data Science And Machine Learning

The Why And How Of Exploratory Data Analysis In Python

How To Install NumPy In Python?

Creating, Validating and Pruning Decision Tree in R

What is Data Science? A Beginner’s Guide To Data Science

Everything You Need To Know About Matrix In Python

Regular Expression in Python With Example

What is print in Python and How to use its Parameters?

What is NumPy in Python – Introduction to NumPy – NumPy Tutorial

Predictive Analytics Process in Business Analytics with R

7 In-Demand Data Analyst Skills to Get You Hired in 2025

Understanding K-means Clustering with Examples

10 Skills To Master For Becoming A Data Scientist

Introduction to Python for Big Data Analytics

Types of Sentiment Analysis

Data Science vs Machine Learning – What’s The Difference?

All You Need To Know About Principal Component Analysis (PCA)

Naive Bayes Classifier: Learning Naive Bayes with Python

Time Series Forecasting: Mastering Techniques and Applications

R Programming – Beginners Guide To R Programming Language

Join the discussionCancel reply

Trending Courses in Data Science

Data Science and Machine Learning Internship ...

Python Programming Certification Course

Data Science with Python Certification Course

Statistics Essentials for Analytics

SAS Training and Certification