Last updated October 13, 2021
In AI Mysteries

Getting Started With Computer Vision Using TensorFlow Keras

Computer Vision attempts what a human brain does with the aid of eyes. It is a branch of Deep Learning that deals with images and videos.

Share

Published on May 8, 2021

by Rajkumar Lakshmanamoorthy

Computer Vision attempts to perform the tasks that a human brain does with the aid of human eyes. Computer Vision is a branch of Deep Learning that deals with images and videos. Computer Vision tasks can be roughly classified into two categories:

Discriminative tasks
Generative tasks

Discriminative tasks, in general, are about predicting the probability of occurrence (e.g. class of an image) given probability distribution (e.g. features of an image). Generative tasks, in general, are about generating the probability distribution (e.g. generating an image) given the probability of occurrence (e.g. class of an image) and/or other conditions.

Discriminative Computer Vision finds applications in image classification, object detection, object recognition, shape detection, pose estimation, image segmentation, etc. Generative Computer Vision finds applications in photo enhancement, image synthesis, augmentation, deepfake videos, etc.

This article aims to give a strong foundation to Computer Vision by exploring image classification tasks using Convolutional Neural Networks built with TensorFlow Keras. More importance has been given to both the coding part and the key concepts of theory and math behind each operation. Let’s start our Computer Vision journey!

Readers are expected to have a basic understanding of deep learning. This article, “Getting Started With Deep Learning Using TensorFlow Keras”, helps one grasp the fundamentals of deep learning.

Import necessary packages, libraries and modules.

 import tensorflow as tf
 import tensorflow_datasets as tfds
 from tensorflow import keras
 import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt

Image Classification with Fashion MNIST dataset

Load the Fashion MNIST dataset from the in-built Keras Datasets.

 fashion_data = keras.datasets.fashion_mnist.load_data()
 (x_train,y_train),(x_val,y_val)= fashion_data

Let’s have a look at the size of the train data.

x_train.shape, y_train.shape

Output:

There are 60,000 grayscale images in the train data, each of size 28×28. For each image, the corresponding label is available in y_train. The official Datasets page informs that there are 10 different classes. The classes are numerically represented from 0 to 9. The images are low-clarity images of fashion collections such as shirts, coats, shoes, trousers, pullovers, and sandals.

Similarly, we can have a look at the size of the validation data.

x_val.shape ,y_val.shape

Output:

There are 10,000 validation images and corresponding labels. Let’s sample an image and visualize it.

 plt.imshow(x_train[10])
 plt.colorbar()
 plt.show()

Output:

The values range from 0 to 255. We should scale the data by dividing the values by 255.0

 x_train = x_train/255.0
 x_val = x_val/255.0

We can visualize some 25 images and their corresponding class labels for a better understanding of the data.

 plt.figure(figsize=(7,7))
 for i in range(1,26):
   plt.subplot(5,5,i)
   plt.imshow(x_train[i])
   plt.title(y_train[i],color='r')
   plt.xticks([])
   plt.yticks([])
 plt.tight_layout()
 plt.show()

Output:

We can model a convolutional neural network to develop an image classifier. However, a convolution layer expects three dimensional data input. Usually, the shape of input images is (width, height, channels). Since we possess grayscale images, their shapes are of (width, height). We should increase the number of dimensions from 2 to 3 by expanding at the last axis.

 x_train = tf.expand_dims(x_train, axis=-1)
 x_val = tf.expand_dims(x_val, axis=-1)
 x_train.shape

Output:

Let us build a Convolutional neural network.

 classifier = keras.models.Sequential([# convolution layer
                                       keras.layers.Conv2D(64,(3,3), activation='relu',input_shape=(28,28,1)),
                                       # flattening layer
                                       keras.layers.Flatten(),
                                       # dense hidden layer
                                       keras.layers.Dense(128, activation='relu'),
                                       # dense output layer
                                       keras.layers.Dense(10, activation='softmax')
 ])

We know well about Dense layers, activation functions and a Sequential model through this deep learning fundamentals article. Here, we discuss the convolution layer, Flatten layer and the way they work. In the above code, we have instantiated a Conv2D layer with a few arguments. The first argument, 64 refers to the number of kernels (similar to neuron units in an artificial neural network). The second argument refers to the size of those kernels. The last argument refers to the input shape of images (mostly in three dimensions- height, width, colour channels).

A grayscale image can be viewed as a matrix of size (height, width), whereas a 3-channel colour image can be viewed as a stack of 3 matrices each of size (height, width). In structured data, each input (or feature) is an individual number that is multiplied by a neuron’s weight to form the output. Similarly, a matrix-form image can be considered as a 2D representation of features that must be multiplied by weights in the form of a matrix to get the necessary output. This matrix-alike weight formulation is called the kernel.

However, a convolution kernel differs greatly from the weights of a Dense neuron. The number of weights in a Dense neuron is equal to the number of inputs to that neuron (plus one bias). But, the number of entries in a kernel is not equal to the number of entries in an image matrix. For instance, in our Conv2D implementation, the size of the image matrix is 28*28, equal to 784 total pixels. The kernel size is 3*3, which is equal to 9 weights (plus one bias).

Then, how do we multiply our 9 weights with the 784 total pixel points to get the weighted sum? First, the kernel is positioned at the top left of the image. The image pixels that conform kernel weights are multiplied respectively, and the weighted sum is entered into the output matrix as its first element. Next, the kernel is slided one, two, or more cells at a time (called strides) to conform to the next sub-pixel matrix. The weighted sum is calculated and entered into the output matrix. Thus, the kernel is moved systematically all over the image to extract features to form the output matrix.

The default value of strides is 1. If the sliding kernel moves through more cells at a time, the output matrix size will be reduced. Further, padding (including extra zeros all over the edges on an image) may be optionally performed to give more importance to the edges of the input image.

image convolution — Convolution of an input image (5*5 blue matrix) by a sliding kernel (3*3 gray matrix) to extract features (3*3 green matrix). The dotted white entries are zeros, added artificially to the input image- called padding to enable the kernel to reach edges and corners effectively. The kernel slides two cells a time- called stride 2*2. (image source)

It should be noted that the number of parameters in a convolution layer is decided purely by the number of kernels, size of a kernel and number of colour channels. The number of parameters is not affected by the size of the input image.

Therefore, number of parameters = no.of kernels x (size of kernel n*n) x no.of channels + 1 bias.

In our Conv2D layer, there are (64 * 3 * 3 * 1) + 1 = 577 parameters.

Finally, the convolution layer is followed by a Flatten layer. It is the bridge between 2-dimensional convolutional layers and 1-dimensional Dense layers. Dense layers form the deciding head that makes the final classification decision. A Flatten layer just breaks down 2D matrix-alike features into 1D vector-alike features.

We have built our convolutional neural network for our Computer Vision task. Here, we define an optimizer, a loss function and a metric required to train and evaluate the model. We use an Adam optimizer (a SGD variant), sparse categorical cross-entropy loss function (for multi-class classification) and accuracy metric.

 classifier.compile(optimizer='adam',
                    loss='sparse_categorical_crossentropy',
                    metrics=['accuracy'])

Perform training for 10 epochs.

history = classifier.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=10)

Output:

Visualize losses and accuracies over epochs for both training and evaluation.

 hist = pd.DataFrame(history.history)
 epochs = np.arange(1,11)
 plt.plot(epochs,hist['loss'], label='Train Loss')
 plt.plot(epochs,hist['val_loss'], label='Val Loss')
 plt.legend()
 plt.ylabel('Loss')
 plt.xlabel('Epochs')
 plt.xticks(epochs)
 plt.show()

Output:

The losses during training are going down, while that during evaluation is exploding. This is the direct cause of overfitting.

 epochs = np.arange(1,11)
 plt.plot(epochs,hist['accuracy'], label='Train Accuracy')
 plt.plot(epochs,hist['val_accuracy'], label='Val Accuracy')
 plt.legend()
 plt.ylabel('Accuracy')
 plt.xlabel('Epochs')
 plt.xticks(epochs)
 plt.show()

Output:

The accuracy plot confirms the insight provided by the losses plot. Steps against overfitting must be taken, such as implementing dropout layers, employing kernel regularizers, reducing model complexity, increasing the amount of data by augmentation and implementing early stopping.

Image Classification with Beans Dataset

We explore more options and methodologies in Computer Vision with a relatively complex dataset. The Beans dataset available in-built with TensorFlow Datasets has images belonging to three classes.

Healthy bean leaves
Leaves with bean rust (unhealthy)
Leaves with angular leaf spot (unhealthy)

The major advantage of the TensorFlow Datasets is that the data is pre-processed and vectorized to enhance the off-the-shelf strategy. Load the beans dataset and its metadata.

 data, meta = tfds.load('beans',
                  as_supervised=True,
                  with_info=True,
                  )
 train, val, test = data['train'], data['validation'], data['test']

The labels corresponding to three classes are provided as 0, 1 and 2. The corresponding readable label name can be extracted from the metadata.

label_extractor = meta.features['label'].int2str

Sample an image. Display its label and size, and visualize it.

 for example,label in train.take(1):
   print(label.numpy())
   print(label_extractor(label))
   print(example.shape)
   plt.imshow(example)
   plt.colorbar()
   plt.show()

Output:

The images are of size 500 by 500 in three colour channels. The pixel values range from 0 to 255. Define a helper function to scale and resize the image to 160 by 160 (for memory efficiency).

 def scale_image(img, label):
   img = tf.cast(img, tf.float32)
   img = img/255.0
   img = tf.image.resize(img,(160,160))
   return img,label

Scale pixel values and resize the images.

 train = train.map(scale_image)
 val = val.map(scale_image)
 test = val.map(scale_image)

View some 9 resized, scaled images along with their classes for better understanding.

 plt.figure(figsize=(7,7))
 i = 1
 for example,label in train.skip(10).take(9):
   plt.subplot(3,3,i)
   plt.title(label_extractor(label),color='r')
   plt.imshow(example)
   plt.xticks([])
   plt.yticks([])
   i += 1
 plt.tight_layout()
 plt.show()

Output:

With some images, we humans can classify the leaves easily. Let’s check how far our model learns the same. Prepare the train and validation data in batches. Because Adam optimizer expects data to be in batches. Shuffle the train images, leaving validation and test images as such.

 train_batch = train.shuffle(1000).batch(64)
 val_batch = val.batch(64)
 test_batch = test.batch(64)

There will be two parts in a convolutional neural network: a base with convolutional layers and their associated layers, and a head with Dense layers and their associated layers. Build a convolutional neural network base with three Conv2D layers and two MaxPooling2D layers in between. While a convolution layer extracts features from the input image or feature map, a max pooling layer retains the important features discarding the less-important features.

 base = keras.models.Sequential([
                                keras.layers.Conv2D(64,(3,3), activation='relu',input_shape=[160,160,3]),
                                keras.layers.MaxPooling2D((2,2)),
                                keras.layers.Conv2D(128,(3,3),strides=2, activation='relu', kernel_regularizer='l1_l2'),
                                keras.layers.MaxPooling2D((2,2)),
                                keras.layers.Conv2D(128,(3,3),strides=2, activation='relu', kernel_regularizer='l1_l2'),                               
 ])

Build a head with one Flatten layer, three Dense layers and one dropout layer.

 head = keras.models.Sequential([
                                 keras.layers.Flatten(),
                                 keras.layers.Dense(128,activation='relu'),
                                 keras.layers.Dropout(0.5),
                                 keras.layers.Dense(64,activation='relu'),
                                 keras.layers.Dense(3,activation='softmax')
 ])

Stack base and head to form the complete architecture. It should be noted that the base and head can be constructed in a single Sequential model in one go.

 model = keras.models.Sequential([base,head])
 Let’s explore the number of parameters in the architecture.
 base.summary()

Output:

head.summary()

Output:

model.summary()

Output:

There are around 1.56 million parameters in our architecture. Let’s define our optimizer, loss function and metric to perform training and evaluation.

 model.compile(
     optimizer='adam',
     loss='sparse_categorical_crossentropy',
     metrics=['accuracy']
 )

Train the model for 40 epochs.

history = model.fit(train_batch, validation_data=val_batch, epochs=40)

A portion of the output:

Analyze the training performance using the training history.

 hist = pd.DataFrame(history.history)
 epochs = np.arange(6,41)
 plt.plot(epochs,hist['loss'][5:], label='Train Loss')
 plt.plot(epochs,hist['val_loss'][5:], label='Val Loss')
 plt.legend()
 plt.ylabel('Loss')
 plt.xlabel('Epochs')
 plt.xticks(np.arange(5,42,2))
 plt.show()

Output:

 epochs = np.arange(1,41)
 plt.plot(epochs,hist['accuracy'], label='Train Accuracy')
 plt.plot(epochs,hist['val_accuracy'], label='Val Accuracy')
 plt.legend()
 plt.ylabel('Accuracy')
 plt.xlabel('Epochs')
 plt.xticks(np.arange(1,42,2))
 plt.show()

Output:

Losses keep on reducing till the final epoch, and accuracies keep on increasing till the final epoch. It suggests that the model should be trained for more epochs until convergence. The curves are not smooth. It suggests implementing Batch Normalization that can provide a stable training experience.

Finally, we deploy our model to predict our test data!

preds = model.predict(test_batch)

Let’s evaluate the performance of prediction on test data.

 images,labels = next(iter(test_batch))
 plt.figure(figsize=(7,7))
 for i in range(9):
   plt.subplot(3,3,i+1)
   pred = np.argmax(preds[i])
   plt.title(f'Actual: {label_extractor(labels[i])}' ,color='b',size=12)
   if pred==labels[i]:
     plt.xlabel(f'Predicted: {label_extractor(pred)}', color='b',size=12)
   else:
     plt.xlabel(f'Predicted: {label_extractor(pred)}', color='r',size=12)
   plt.imshow(images[i])
   plt.xticks([])
   plt.yticks([])
 plt.tight_layout()
 plt.show()

Output:

Actual labels are in the top of each image (blue in colour). Predicted labels are at the bottom of each image. Labels in blue and red colours refer to correct and incorrect predictions respectively.

Find here the Google Colab notebook with the above codes.

Wrapping Up

In this article, we have discussed different computer vision tasks. We explored image classification with convolutional neural networks and two famous datasets using TensorFlow Keras. We have discussed the convolution operation and other operations associated with convolutional neural networks.

It’s now your turn to perform the image classification with your image dataset!