Last updated October 13, 2021
In AI Mysteries

Python Code for Evaluation Metrics in ML/AI for Classification Problems

Published on March 7, 2021

by Rajkumar Lakshmanamoorthy

Evaluation of a machine learning model is crucial to measure its performance. Numerous metrics are used in the evaluation of a machine learning model. Selection of the most suitable metrics is important to fine-tune a model based on its performance. In this article, we discuss the mathematical background and application of evaluation metrics in classification problems.

We can start discussing evaluation metrics by building a machine learning classification model. Here breast cancer data from sklearn’s in-built datasets is used to build a random forest binary classification model.

Import necessary libraries and packages to prepare the required environment.

 from sklearn.datasets import load_breast_cancer
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.model_selection import train_test_split
 from sklearn import metrics
 import pandas as pd
 import numpy as np
 from matplotlib import pyplot as plt
 import seaborn as sns
 sns.set_style('darkgrid')

Load data, split it into train-test set, build and train the model, and make predictions on test data.

 # choose a binary classification problem
 data = load_breast_cancer()
 # develop predictors X and target y dataframes
 X = pd.DataFrame(data['data'], columns=data['feature_names'])
 y = abs(pd.Series(data['target'])-1)
 # split data into train and test set in 80:20 ratio
 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
 # build a RF model with default parameters
 model = RandomForestClassifier(random_state=1)
 model.fit(X_train, y_train)
 preds = model.predict(X_test)

Confusion Matrix

Without a clear understanding of the confusion matrix, it is hard to proceed with any of classification evaluation metrics. The confusion matrix provides a base to define and develop any of the evaluation metrics. Before discussing the confusion matrix, it is important to know the classes in the dataset and their distribution.

 plt.figure(figsize=(7,7))
 y.value_counts().plot.pie(ylabel=' ', autopct = '%0.1f%%')
 plt.title(f'0 - Not cancerous (negative)\n 1 - Cancerous (positive)        ', size=14, c='green')
 plt.tight_layout(); plt.show()

There are two classes in the dataset. 0 refers to ‘Benign’: a non-cancerous state, we simply denote it as ‘negative’. 1 refers to ‘Malignant’: a cancerous state, we simply denote it as ‘positive’. In the dataset, there are 357 negative cases and 212 positive cases. It is clear that class distribution is highly imbalanced.

Knowledge of the following terms will be of more use to proceed further with metrics.

True Positive: Actually positive (ground truth), predicted as positive (correctly classified)

True Negative: Actually negative (ground truth), predicted as negative (correctly classified)

False Positive: Actually negative (ground truth), predicted as positive (misclassified)

False Negative: Actually positive (ground truth), predicted as negative (misclassified)

To plot a confusion matrix,

metrics.plot_confusion_matrix(model, X_test, y_test, display_labels=['Negative', 'Positive'])

Here out of 114 total test samples, 72 are True Negatives (TN), 37 are True Positives (TP), 5 are False Negatives (FN), and there are no False Positives (FP).

 confusion = metrics.confusion_matrix(y_test, preds)
 confusion.ravel()

yields the output array([72, 0, 5, 37])

Most of the evaluation metrics are defined with the terms found in the confusion matrix.

Accuracy

Accuracy can also be defined as the ratio of the number of correctly classified cases to the total of cases under evaluation. The best value of accuracy is 1 and the worst value is 0.

In python, the following code calculates the accuracy of the machine learning model.

 accuracy = metrics.accuracy_score(y_test, preds)
 accuracy

It gives 0.956 as output. However, care should be taken while using accuracy as a metric because it gives biased results for data with unbalanced classes. We discussed that our data is highly unbalanced, hence the accuracy score may be a biased one!

Precision

Precision can be defined with respect to either of the classes. The precision of negative class is intuitively the ability of the classifier not to label as positive a sample that is negative. The precision of positive class is intuitively the ability of the classifier not to label as negative a sample that is positive. The best value of precision is 1 and the worst value is 0.

In Python, precision can be calculated using the code,

 precision_positive = metrics.precision_score(y_test, preds, pos_label=1)
 precision_negative = metrics.precision_score(y_test, preds, pos_label=0)
 precision_positive, precision_negative

which gives (1.000, 0.935) as output.

Recall

Recall can also be defined with respect to either of the classes. Recall of positive class is also termed sensitivity and is defined as the ratio of the True Positive to the number of actual positive cases. It can intuitively be expressed as the ability of the classifier to capture all the positive cases. It is also called the True Positive Rate (TPR).

Recall of negative class is also termed specificity and is defined as the ratio of the True Negative to the number of actual negative cases. It can intuitively be expressed as the ability of the classifier to capture all the negative cases. It is also called True Negative Rate (TNR).

In python, sensitivity and specificity can be calculated as

 recall_sensitivity = metrics.recall_score(y_test, preds, pos_label=1)
 recall_specificity = metrics.recall_score(y_test, preds, pos_label=0)
 recall_sensitivity, recall_specificity

which gives (0.881, 1.000) as output. The best value of recall is 1 and the worst value is 0.

F1-score

F1-score is considered one of the best metrics for classification models regardless of class imbalance. F1-score is the weighted average of recall and precision of the respective class. Its best value is 1 and the worst value is 0.

In python, F1-score can be determined for a classification model using

 f1_positive = metrics.f1_score(y_test, preds, pos_label=1)
 f1_negative = metrics.f1_score(y_test, preds, pos_label=0)
 f1_positive, f1_negative

It gives an output of (0.937, 0.966)

Accuracy, Precision, Recall, and F1-score can altogether be calculated using the method classification_report in python

print(metrics.classification_report(y_test, preds))

outputs

Here, the macro average of any metric is calculated as the mean of respective values of all classes by giving equal weightage to all classes. On the other hand, the weighted average of any metric is calculated by giving weightage based on the number of data points in respective classes. In the above output, numbers 0 and 1 denote negative and positive classes respectively and column support refers to the number of data points in those classes.

ROC and AUC score

ROC is the short form of Receiver Operating Curve, which helps determine the optimum threshold value for classification. The threshold value is the floating-point value between two classes forming a boundary between those two classes. Here in our model, any predicted output above the threshold is classified as class 1 and below it is classified as class 0.

ROC is realized by visualizing it in a plot. The area under ROC, famously known as AUC is used as a metric to evaluate the classification model. ROC is drawn by taking false positive rate in the x-axis and true positive rate in the y-axis. The best value of AUC is 1 and the worst value is 0. However, AUC of 0.5 is generally considered the bottom reference of a classification model.

In python, ROC can be plotted by calculating the true positive rate and false-positive rate. The values are calculated in steps by changing the threshold value from 0 to 1 gradually.

 sns.set_style('darkgrid')
 preds_train = model.predict(X_train)
 # calculate prediction probability
 prob_train = np.squeeze(model.predict_proba(X_train)[:,1].reshape(1,-1))
 prob_test = np.squeeze(model.predict_proba(X_test)[:,1].reshape(1,-1))
 # false positive rate, true positive rate, thresholds
 fpr1, tpr1, thresholds1 = metrics.roc_curve(y_test, prob_test)
 fpr2, tpr2, thresholds2 = metrics.roc_curve(y_train, prob_train)
 # auc score
 auc1 = metrics.auc(fpr1, tpr1)
 auc2 = metrics.auc(fpr2, tpr2)
 plt.figure(figsize=(8,8))
 # plot auc 
 plt.plot(fpr1, tpr1, color='blue', label='Test ROC curve area = %0.2f'%auc1)
 plt.plot(fpr2, tpr2, color='green', label='Train ROC curve area = %0.2f'%auc2)
 plt.plot([0,1],[0,1], 'r--')
 plt.xlim([-0.1, 1.1])
 plt.ylim([-0.1, 1.1])
 plt.xlabel('False Positive Rate', size=14)
 plt.ylabel('True Positive Rate', size=14)
 plt.legend(loc='lower right')
 plt.show()

Tuning ROC to find the optimum threshold value: Python guides find the right value of threshold (cut-off) with the following codes.

 # creating index
 i = np.arange(len(tpr1))
 # extracting roc values against different thresholds 
 roc = pd.DataFrame({'fpr':fpr1, 'tpr':tpr1, 'tf':(tpr1-1+fpr1), 'thresholds':thresholds1}, index=i)
 # top 5 best roc occurrences 
 roc.iloc[(roc.tf-0).abs().argsort()[:5]]

Precision-Recall Curve

To find the best threshold value based on the trade-off between precision and recall, precision_recall_curve is drawn.

 pre, rec, thr = metrics.precision_recall_curve(y_test, prob_test)
 plt.figure(figsize=(8,4))
 plt.plot(thr, pre[:-1], label='precision')
 plt.plot(thr, rec[1:], label='recall')
 plt.xlabel('Threshold')
 plt.title('Precision & Recall vs Threshold', c='r', size=16)
 plt.legend()
 plt.show()

Trade-off performed by our random forest model between Precision and Recall can be visualized using the following codes:

 fig, ax = plt.subplots(1,1, figsize=(8,8))
 metrics.plot_precision_recall_curve(model, X_test, y_test, ax=ax)

Hamming Loss

Hamming loss is the fraction of targets that are misclassified. The best value of the hamming loss is 0 and the worst value is 1. It can be calculated as

 hamming_loss = metrics.hamming_loss(y_test, preds)
 hamming_loss

to give an output of 0.044.

Jaccard Score

Jaccard score is defined as the ratio of the size of the intersection to the size of the union of label classes between predicted labels and ground truth labels. It is considered a similarity coefficient to compare the predicted classes and true classes. The value of 1 denotes the best classification and 0 denotes the worst. Jaccard loss is considered a poor choice if the class distribution is imbalanced.

 jaccard = metrics.jaccard_score(y_test, preds)
 jaccard

gives an output of 0.881

Cross-entropy loss

Cross-entropy loss, also known as log loss, becomes famous in deep neural networks because of its ability to overcome vanishing gradient problems. It measures the impurity caused by misclassification. The cross-entropy loss is calculated as the summation of the logarithmic value of prediction probability distribution for misclassified data points.

In python, cross-entropy loss can be calculated using the code,

 # Entropy loss
 cross_entropy_loss = metrics.log_loss(y_test, prob_test)
 cross_entropy_loss

which gives an output of 0.463, where 0 denotes perfect classification or zero impurity.

Other metrics

Though we have covered most of the evaluation metrics for classification in this article, few metrics meant only for multi-class classification are left untouched. Interested readers can refer to the official documentation of metrics used by Scikit-Learn, TensorFlow, and PyTorch.

For further readings:

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Rajkumar Lakshmanamoorthy

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.

DeepMind Wants to Take Humans Out of RLHF

LLM Leaderboard Gone Wrong?

8 AI/ML Terms Explained for Beginners

6 Techniques to Handle Imbalanced Data

Accuracy Isn’t the Best Metric for Imbalanced Data

How does Mean Average Precision (mAP) work in Object Detection?

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the