Machine learning made easy with Python

Solve real-world machine learning problems with Naïve Bayes classifiers.
127 readers like this.
arrows cycle symbol for failing faster

Opensource.com

Naïve Bayes is a classification technique that serves as the basis for implementing several classifier modeling algorithms. Naïve Bayes-based classifiers are considered some of the simplest, fastest, and easiest-to-use machine learning techniques, yet are still effective for real-world applications.

Naïve Bayes is based on Bayes' theorem, formulated by 18th-century statistician Thomas Bayes. This theorem assesses the probability that an event will occur based on conditions related to the event. For example, an individual with Parkinson's disease typically has voice variations; hence such symptoms are considered related to the prediction of a Parkinson's diagnosis. The original Bayes' theorem provides a method to determine the probability of a target event, and the Naïve variant extends and simplifies this method.

Solving a real-world problem

This article demonstrates a Naïve Bayes classifier's capabilities to solve a real-world problem (as opposed to a complete business-grade application). I'll assume you have basic familiarity with machine learning (ML), so some of the steps that are not primarily related to ML prediction, such as data shuffling and splitting, are not covered here. If you are an ML beginner or need a refresher, see An introduction to machine learning today and Getting started with open source machine learning.

The Naïve Bayes classifier is supervised, generative, non-linear, parametric, and probabilistic.

In this article, I'll demonstrate using Naïve Bayes with the example of predicting a Parkinson's diagnosis. The dataset for this example comes from this UCI Machine Learning Repository. This data includes several speech signal variations to assess the likelihood of the medical condition; this example will use the first eight of them:

  • MDVP:Fo(Hz): Average vocal fundamental frequency
  • MDVP:Fhi(Hz): Maximum vocal fundamental frequency
  • MDVP:Flo(Hz): Minimum vocal fundamental frequency
  • MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, and Jitter:DDP: Five measures of variation in fundamental frequency

The dataset used in this example, shuffled and split for use, is available in my GitHub repository.

ML with Python

I'll use Python to implement the solution. The software I used for this application is:

  • Python 3.8.2
  • Pandas 1.1.1
  • scikit-learn 0.22.2.post1

There are several open source Naïve Bayes classifier implementations available in Python, including:

  • NLTK Naïve Bayes: Based on the standard Naïve Bayes algorithm for text classification
  • NLTK Positive Naïve Bayes: A variant of NLTK Naïve Bayes that performs binary classification with partially labeled training sets
  • Scikit-learn Gaussian Naïve Bayes: Provides partial fit to support a data stream or very large dataset
  • Scikit-learn Multinomial Naïve Bayes: Optimized for discrete data features, example counts, or frequency
  • Scikit-learn Bernoulli Naïve Bayes: Designed for binary/Boolean features

I will use sklearn Gaussian Naive Bayes for this example.

Here is my Python implementation of naive_bayes_parkinsons.py:

import pandas as pd

# Feature columns we use
x_rows=['MDVP:Fo(Hz)','MDVP:Fhi(Hz)','MDVP:Flo(Hz)',
        'MDVP:Jitter(%)','MDVP:Jitter(Abs)','MDVP:RAP','MDVP:PPQ','Jitter:DDP']
y_rows=['status']

# Train

# Read train data
train_data = pd.read_csv('parkinsons/Data_Parkinsons_TRAIN.csv')
train_x = train_data[x_rows]
train_y = train_data[y_rows]
print("train_x:\n", train_x)
print("train_y:\n", train_y)

# Load sklearn Gaussian Naive Bayes and fit
from sklearn.naive_bayes import GaussianNB 

gnb = GaussianNB() 
gnb.fit(train_x, train_y) 

# Prediction on train data
predict_train = gnb.predict(train_x)
print('Prediction on train data:', predict_train) 

# Accuray score on train data
from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(train_y, predict_train)
print('Accuray score on train data:', accuracy_train)

# Test

# Read test data
test_data = pd.read_csv('parkinsons/Data_Parkinsons_TEST.csv')
test_x = test_data[x_rows]
test_y = test_data[y_rows]

# Prediction on test data
predict_test = gnb.predict(test_x)
print('Prediction on test data:', predict_test) 

# Accuracy Score on test data
accuracy_test = accuracy_score(test_y, predict_test)
print('Accuray score on test data:', accuracy_train)

Run the Python application:

$ python naive_bayes_parkinsons.py

train_x:
      MDVP:Fo(Hz)  MDVP:Fhi(Hz) ...  MDVP:RAP  MDVP:PPQ  Jitter:DDP
0        152.125       161.469  ...   0.00191   0.00226     0.00574
1        120.080       139.710  ...   0.00180   0.00220     0.00540
2        122.400       148.650  ...   0.00465   0.00696     0.01394
3        237.323       243.709  ...   0.00173   0.00159     0.00519
..           ...           ...           ...  ...       ...       ...         
155      138.190       203.522  ...   0.00406   0.00398     0.01218

[156 rows x 8 columns]

train_y:
      status
0         1
1         1
2         1
3         0
..      ...
155       1

[156 rows x 1 columns]

Prediction on train data: [1 1 1 0 ... 1]
Accuracy score on train data: 0.6666666666666666

Prediction on test data: [1 1 1 1 ... 1
 1 1]
Accuracy score on test data: 0.6666666666666666

The accuracy scores on the train and test sets are 67% in this example; its performance can be optimized. Do you want to give it a try? If so, share your approach in the comments below.

Under the hood

The Naïve Bayes classifier is based on Bayes' rule or theorem, which computes conditional probability, or the likelihood for an event to occur when another related event has occurred. Stated in simple terms, it answers the question: If we know the probability that event x occurred before event y, then what is the probability that y will occur when x occurs again? The rule uses a prior-prediction value that is refined gradually to arrive at a final posterior value. A fundamental assumption of Bayes is that all parameters are of equal importance.

At a high level, the steps involved in Bayes' computation are:

  1. Compute overall posterior probabilities ("Has Parkinson's" and "Doesn't have Parkinson's")
  2. Compute probabilities of posteriors across all values and each possible value of the event
  3. Compute final posterior probability by multiplying the results of #1 and #2 for desired events

Step 2 can be computationally quite arduous. Naïve Bayes simplifies it:

  1. Compute overall posterior probabilities ("Has Parkinson's" and "Doesn't have Parkinson's")
  2. Compute probabilities of posteriors for desired event values
  3. Compute final posterior probability by multiplying the results of #1 and #2 for desired events

This is a very basic explanation, and several other factors must be considered, such as data types, sparse data, missing data, and more.

Hyperparameters

Naïve Bayes, being a simple and direct algorithm, does not need hyperparameters. However, specific implementations may provide advanced features. For example, GaussianNB has two:

  • priors: Prior probabilities can be specified instead of the algorithm taking the priors from data.
  • var_smoothing: This provides the ability to consider data-curve variations, which is helpful when the data does not follow a typical Gaussian distribution.

Loss functions

Maintaining its philosophy of simplicity, Naïve Bayes uses a 0-1 loss function. If the prediction correctly matches the expected outcome, the loss is 0, and it's 1 otherwise.

Pros and cons

Pro: Naïve Bayes is one of the easiest and fastest algorithms.

Pro: Naïve Bayes gives reasonable predictions even with less data.

Con: Naïve Bayes predictions are estimates, not precise. It favors speed over accuracy.

Con: A fundamental Naïve Bayes assumption is the independence of all features, but this may not always be true.

In essence, Naïve Bayes is an extension of Bayes' theorem. It is one of the simplest and fastest machine learning algorithms, intended for easy and quick training and prediction. Naïve Bayes provides good-enough, reasonably accurate predictions. One of its fundamental assumptions is the independence of prediction features. Several open source implementations are available with traits over and above what are available in the Bayes algorithm.

What to read next
User profile image.
Girish has over 20 years’ experience in technology and software at a global IT Services organization based in India. Girish is architect of "I Got" cloud platform to uplift the bottom of the pyramid built with open source stack and contemporary architectural patterns such as microservices, containerisation and multi tenancy. Girish writes on open source and tech topics.

3 Comments

I agree with you fully that Python is the best programming language for Machine Learning.

After my Parkinson's Disease diagnosis, my primary care provider introduced me to Mayaka Natural Clinic and their PD Formula protocol, the herbal treatment has made a tremendous difference for me. My symptoms including numbness and muscle weakness all disappeared after the treatment plan! Their website is w w w. mayakanaturalclinic. c o m

READ PLEASE

Worldherbsclinic Parkinson’s disease herbal FORMULA helped my dear husband. I feel so bad for him because this is a hard disease and he tried and tried and gets so frustrated. Over TWO months plus now maybe or so he has been taking this product and he is doing good. I see it happening too. He is happier and buttons his own shirts now. My husband is a good guy and we have been married for FIFTY ONE years. This disease takes a person away slowly and it painful. But now my dear husband is more like himself again. He is more spry and has his wits about him. I think he is more sassy though, but I count all as good, as before he just really stopped talking. Now we talk again and I missed him so much. Thanks to Worldherbsclinic for this product. We don't have much time left in this world and you have given us more time together. Thanks again. I hope this review works for all PD patients and that I do it right. I am not good on the computer.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.