A Step By Step Guide To Implement Naive Bayes In R

Data Science (24 Blogs) Become a Certified Professional

Machine Learning has become the most in-demand skill in the market. It is essential to know the various Machine Learning Algorithms and how they work. In this blog on Naive Bayes In R, I intend to help you learn about how Naive Bayes works and how it can be implemented using the R language.

To get in-depth knowledge on Data Science, you can enroll for live Data Science Certification Training by Edureka with 24/7 support and lifetime access.

The following topics are covered in this blog:

What Is Naive Bayes?

Naive Bayes is a Supervised Machine Learning algorithm based on the Bayes Theorem that is used to solve classification problems by following a probabilistic approach. It is based on the idea that the predictor variables in a Machine Learning model are independent of each other. Meaning that the outcome of a model depends on a set of independent variables that have nothing to do with each other.

But why is Naive Bayes called ‘Naive’?

In real-world problems, predictor variables aren’t always independent of each other, there are always some correlations between them. Since Naive Bayes considers each predictor variable to be independent of any other variable in the model, it is called ‘Naive’.

Now let’s understand the logic behind the Naive Bayes algorithm.

The Math Behind Naive Bayes

The principle behind Naive Bayes is the Bayes theorem also known as the Bayes Rule. The Bayes theorem is used to calculate the conditional probability, which is nothing but the probability of an event occurring based on information about the events in the past. Mathematically, the Bayes theorem is represented as:

Bayes Theorem – Naive Bayes In R – Edureka

In the above equation:

P(A|B): Conditional probability of event A occurring, given the event B
P(A): Probability of event A occurring
P(B): Probability of event B occurring
P(B|A): Conditional probability of event B occurring, given the event A

Formally, the terminologies of the Bayesian Theorem are as follows:

A is known as the proposition and B is the evidence
P(A) represents the prior probability of the proposition
P(B) represents the prior probability of evidence
P(A|B) is called the posterior
P(B|A) is the likelihood

Therefore, the Bayes theorem can be summed up as:

Posterior=(Likelihood).(Proposition prior probability)/Evidence prior probability

It can also be considered in the following manner:

Given a Hypothesis H and evidence E, Bayes Theorem states that the relationship between the probability of Hypothesis before getting the evidence P(H) and the probability of the hypothesis after getting the evidence P(H|E) is:

Bayes Theorem In Terms Of Hypothesis – Naive Bayes In R – Edureka

Now that you know what the Bayes Theorem is, let’s see how it can be derived.

Derivation Of The Bayes Theorem

The main aim of the Bayes Theorem is to calculate the conditional probability. The Bayes Rule can be derived from the following two equations:

The below equation represents the conditional probability of A, given B:

Deriving Bayes Theorem Equation 1 – Naive Bayes In R – Edureka

The below equation represents the conditional probability of B, given A:

Deriving Bayes Theorem Equation 2 – Naive Bayes In R – Edureka

Therefore, on combining the above two equations we get the Bayes Theorem:

Bayes Theorem – Naive Bayes In R – Edureka

Bayes Theorem for Naive Bayes Algorithm

The above equation was for a single predictor variable, however, in real-world applications, there are more than one predictor variables and for a classification problem, there is more than one output class. The classes can be represented as, C1, C2,…, Ck and the predictor variables can be represented as a vector, x1,x2,…,xn.

The objective of a Naive Bayes algorithm is to measure the conditional probability of an event with a feature vector x1,x2,…,xn belonging to a particular class Ci,

On computing the above equation, we get:

However, the conditional probability, i.e., P(xj|xj+1,…,xn,Ci) sums down to P(xj|Ci) since each predictor variable is independent in Naive Bayes.

The final equation comes down to:

Here, P(x1,x2,…,xn) is constant for all the classes, therefore we get:

How Does Naive Bayes Work?

To get a better understanding of how Naive Bayes works, let’s look at an example.

Consider a data set with 1500 observations and the following output classes:

Cat
Parrot
Turtle

The Predictor variables are categorical in nature i.e., they store two values, either True or False:

Swim
Wings
Green Color
Sharp Teeth

Naive Bayes Example – Naive Bayes In R – Edureka

From the above table, we can summarise that:

The class of type cats shows that:

Out of 500, 450 (90%) cats can swim
0 number of cats have wings
0 number of cats are of Green color
All 500 cats have sharp teeth

The class of type Parrot shows that:

50 (10%) parrots have a true value for swim
All 500 parrots have wings
Out of 500, 400 (80%) parrots are green in color
No parrots have sharp teeth

The class of type Turtle shows:

All 500 turtles can swim
0 number of turtles have wings
Out of 500, 100 (20%) turtles are green in color
50 out of 500 (10%) turtles have sharp teeth

Now, with the available data, let’s classify the following observation into one of the output classes (Cats, Parrot or Turtle) by using the Naive Bayes Classifier.

The goal here is to predict whether the animal is a Cat, Parrot or a Turtle based on the defined predictor variables (swim, wings, green, sharp teeth).

To solve this, we will use the Naive Bayes approach,
P(H|Multiple Evidences) = P(C1| H)* P(C2|H) ……*P(Cn|H) * P(H) / P(Multiple Evidences)

In the observation, the variables Swim and Green are true and the outcome can be any one of the animals (Cat, Parrot, Turtle).

To check if the animal is a cat:
P(Cat | Swim, Green) = P(Swim|Cat) * P(Green|Cat) * P(Cat) / P(Swim, Green)
= 0.9 * 0 * 0.333 / P(Swim, Green)
= 0

To check if the animal is a Parrot:
P(Parrot| Swim, Green) = P(Swim|Parrot) * P(Green|Parrot) * P(Parrot) / P(Swim, Green)
= 0.1 * 0.80 * 0.333 / P(Swim, Green)
= 0.0264/ P(Swim, Green)

To check if the animal is a Turtle:
P(Turtle| Swim, Green) = P(Swim|Turtle) * P(Green|Turtle) * P(Turtle) / P(Swim, Green)
= 1 * 0.2 * 0.333 / P(Swim, Green)
= 0.0666/ P(Swim, Green)

For all the above calculations the denominator is the same i.e, P(Swim, Green). The value of P(Turtle| Swim, Green) is greater than P(Parrot| Swim, Green), therefore we can correctly predict the class of the animal as Turtle.

Now let’s see how you can implement Naive Bayes using the R language.

Practical Implementation of Naive Bayes In R

Problem Statement: To study a Diabetes data set and build a Machine Learning model that predicts whether or not a person has Diabetes.

Data Set Description: The given data set contains 100s of observations of patients along with their health details. Here’s a list of the predictor variables that will help us classify a patient as either Diabetic or Normal:

Pregnancies: Number of pregnancies so far
Glucose: Plasma glucose concentration
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)

The response variable or the output variable is:

Outcome: Class variable (0 or 1)

Logic: To build a Naive Bayes model in order to classify patients as either Diabetic or normal by studying their medical records such as Glucose level, age, BMI, etc.

Now that you know the objective of this demo, let’s get our brains working and start coding. For this demo, I’ll be using the R language in order to build the model.

If you wish to learn more about R programming, you can go through this video recorded by our R Programming Experts.

R Tutorial For Beginners | R Training | Edureka

This Edureka R Tutorial will help you in understanding the fundamentals of R tool and help you build a strong foundation in R.

Now, let’s begin.

Step 1: Install and load the requires packages


#Loading required packages
install.packages('tidyverse')
library(tidyverse)
install.packages('ggplot2')
library(ggplot2)
install.packages('caret')
library(caret)
install.packages('caretEnsemble')
library(caretEnsemble)
install.packages('psych')
library(psych)
install.packages('Amelia')
library(Amelia)
install.packages('mice')
library(mice)
install.packages('GGally')
library(GGally)
install.packages('rpart')
library(rpart)
install.packages('randomForest')
library(randomForest)

Step 2: Import the data set


#Reading data into R
data<- read.csv("/Users/Zulaikha_Geer/Desktop/NaiveBayesData/diabetes.csv")

Before we study the data set let’s convert the output variable (‘Outcome’) into a categorical variable. This is necessary because our output will be in the form of 2 classes, True or False. Where true will denote that a patient has diabetes and false denotes that a person is diabetes free.


#Setting outcome variables as categorical
data$Outcome <- factor(data$Outcome, levels = c(0,1), labels = c("False", "True"))

Step 3: Studying the Data Set


#Studying the structure of the data

str(data)

Understanding the data set – Naive Bayes In R – Edureka


head(data)

Understanding the data set – Naive Bayes In R – Edureka


describe(data)

Understanding the data set – Naive Bayes In R – Edureka

Step 4: Data Cleaning

While analyzing the structure of the data set, we can see that the minimum values for Glucose, Bloodpressure, Skinthickness, Insulin, and BMI are all zero. This is not ideal since no one can have a value of zero for Glucose, blood pressure, etc. Therefore, such values are treated as missing observations.

In the below code snippet, we’re setting the zero values to NA’s:


#Convert '0' values into NA
data[, 2:7][data[, 2:7] == 0] <- NA

To check how many missing values we have now, let’s visualize the data:


#visualize the missing data
missmap(data)

Missing Data Plot – Naive Bayes In R – Edureka

The above illustrations show that our data set has plenty missing values and removing all of them will leave us with an even smaller data set, therefore, we can perform imputations by using the mice package in R.


#Use mice package to predict missing values
mice_mod <- mice(data[, c("Glucose","BloodPressure","SkinThickness","Insulin","BMI")], method='rf')
mice_complete <- complete(mice_mod)

iter imp variable
1 1 Glucose BloodPressure SkinThickness Insulin BMI
1 2 Glucose BloodPressure SkinThickness Insulin BMI
1 3 Glucose BloodPressure SkinThickness Insulin BMI
1 4 Glucose BloodPressure SkinThickness Insulin BMI
1 5 Glucose BloodPressure SkinThickness Insulin BMI
2 1 Glucose BloodPressure SkinThickness Insulin BMI
2 2 Glucose BloodPressure SkinThickness Insulin BMI
2 3 Glucose BloodPressure SkinThickness Insulin BMI
2 4 Glucose BloodPressure SkinThickness Insulin BMI
2 5 Glucose BloodPressure SkinThickness Insulin BMI

#Transfer the predicted missing values into the main data set
data$Glucose <- mice_complete$Glucose
data$BloodPressure <- mice_complete$BloodPressure
data$SkinThickness <- mice_complete$SkinThickness
data$Insulin<- mice_complete$Insulin
data$BMI <- mice_complete$BMI

To check if there are still any missing values, let’s use the missmap plot:


missmap(data)

Using Mice Package In R – Naive Bayes In R – Edureka

The output looks good, there is no missing data.

Step 5: Exploratory Data Analysis

Now let’s perform a couple of visualizations to take a better look at each variable, this stage is essential to understand the significance of each predictor variable.


#Data Visualization
#Visual 1
ggplot(data, aes(Age, colour = Outcome)) +
geom_freqpoly(binwidth = 1) + labs(title="Age Distribution by Outcome")