Last updated February 28, 2024
In AI Mysteries

Getting Started With Image Generation Using TensorFlow Keras

Image Generation is one of the most curious applications in Computer Vision. Variational Autoencoders and GANs are the preferred base models

Share

Published on May 25, 2021

by Rajkumar Lakshmanamoorthy

Computer Vision is a wide, deep learning field with enormous applications. Image Generation is one of the most curious applications in Computer Vision. Again, Image Generation has a great collection of tasks; to mention, a few can outperform humans. Most image generation tasks are common for videos, too, since a video is a sequence of images.

A few popular Image Generation tasks are:

Image-to-Image translation (e.g. grayscale image to colour image)
Text-to-Image translation
Super-resolution
Photo-to-Cartoon/Emoji translation
Image inpainting
Image dataset generation
Medical Image generation
Realistic photo generation
Semantic-to-Photo translation
Image blending
Deepfake video generation
2D-to-3D image translation

One deep learning generative model can perform one or more tasks with a few configuration changes. Some famous image generative models are the original versions and the numerous variants of Variational Autoencoder (VAE), and Generative Adversarial Networks (GAN).

This article discusses the concepts behind image generation and the code implementation of Variational Autoencoder with a practical example using TensorFlow Keras. TensorFlow is one of the top preferred frameworks for deep learning processes. Keras is a high-level API built on top of TensorFlow, which is meant exclusively for deep learning.

The following articles may fulfil the prerequisites by giving an understanding of deep learning and computer vision.

How does Image Generation work?

Whether it is a VAE, or a GAN, or a variant, the common elements are an encoder and a decoder. An encoder is a deep neural network that transforms the high-dimensional input image into a low-dimensional latent vector representation. A decoder is a deep neural network that transforms the low-dimensional latent vector representation into a high-dimensional representation that is called the generated image. This encoder and decoder alone comprise the traditional Autoencoder (AE). Variational Autoencoder (VAE) was introduced with a modification in AE architecture to improve the image generation capabilities. The encoder part encodes the input image into a Gaussian representation that comprises Mean and Variance. A sampler samples these mean and variance vectors and develops the required latent representation. Later, the decoder part generates the synthetic image from this latent representation.

VAE architecture — An Overview of the VAE Architecture

Since a high-dimensional input image is compressed by the encoder to a low-dimensional representation, the decoder is trained to generate a high-dimensional image out of the key representations. During training, the entire model compares the generated image and input image, calculates the loss and back-propagates it to train the network’s weights. Once the model is trained, the encoder part is discarded during inference. The decoder part makes inferences (i.e., generates images) based on the sampling, which becomes the input. Since the decoder part is used to generate the images, it is also called the generator.

Create the Environment

Create the necessary Python environment by importing the required frameworks, libraries and modules.

 import numpy as np
 import tensorflow as tf
 from tensorflow import keras
 from tensorflow.keras import layers

Load an Image Dataset

We use Fashion MNIST data available in-built with Keras Datasets.

 fashion_data = keras.datasets.fashion_mnist.load_data()
 (x_train,y_train),(x_val,y_val)= fashion_data 
 x_train.shape, x_val.shape

Output:

There are 60000 images in the train set and 10000 images in the validation set. Each image is a grayscale (1 channel) image of shape 28 by 28. Image generation using VAE follows a self-supervised approach. Therefore, we may delete the y_train and y_val data to save memory.

del y_train, y_val

Visualize an example from the downloaded image data to get a better insight.

 plt.imshow(x_train[10])
 plt.colorbar()
 plt.show()

Output:

It can be observed that the pixel values range from 0 to 255. We need to scale the values. Further, convolutional layers expect three-dimensional inputs, whereas the available images are in two dimensions. Self-supervised models do not require separate datasets for training and validation. We can merge the available training and validation sets to get relatively large data for training.

 # Merge two datasets
 data = tf.concat([x_train, x_val], axis=0)
 # images from 2D to 3D
 data = tf.expand_dims(data, -1)
 # scale the images to [0,1]
 data = tf.cast(data, tf.float32)
 data = data / 255.0

Build the VAE Architecture

 class Sampling(layers.Layer):
     def call(self, inputs):
         mean, logvar = inputs
         batch = tf.shape(mean)[0]
         dim = tf.shape(mean)[1]
         eps = tf.keras.backend.random_normal(shape=(batch, dim))
         return mean + tf.exp(0.5 * logvar) * eps

Build an encoder that takes an image as input and outputs sampling representation as output.

 encoder_inputs = keras.Input(shape=(28, 28, 1))
 x = layers.Conv2D(32, 3, activation="relu", strides=2, padding="same")(encoder_inputs)
 x = layers.Conv2D(64, 3, activation="relu", strides=2, padding="same")(x)
 x = layers.Flatten()(x)
 x = layers.Dense(16, activation="relu")(x)
 mean = layers.Dense(2, name="z_mean")(x)
 logvar = layers.Dense(2, name="z_log_var")(x)
 z = Sampling()([mean, logvar])
 encoder = keras.Model(encoder_inputs, [mean, logvar, z], name="encoder")
 encoder.summary()

Output:

Plotting the model is always a great way to ensure shapes and workflow.

keras.utils.plot_model(encoder, show_shapes=True, dpi=64)

Output:

Build a decoder that takes the inputs from the encoder, performs transpose convolution, and develops a synthetic image of size 14 by 14.

 latent_inputs = keras.Input(shape=(2,))
 x = layers.Dense(7 * 7 * 64, activation="relu")(latent_inputs)
 # form 7 by 7 feature map
 x = layers.Reshape((7, 7, 64))(x)
 # form 14 by 14 feature map
 x = layers.Conv2DTranspose(64, 3, activation="relu", strides=2, padding="same")(x)
 # form 28 by 28 feature map
 x = layers.Conv2DTranspose(32, 3, activation="relu", strides=2, padding="same")(x)
 # form the sigmoid output - single image
 decoder_outputs = layers.Conv2DTranspose(1, 3, activation="sigmoid", padding="same")(x)
 decoder = keras.Model(latent_inputs, decoder_outputs, name="decoder")
 decoder.summary()

Output:

Plot the decoder to get a better understanding.

keras.utils.plot_model(decoder, show_shapes=True, dpi=64)

Output:

Let’s formulate the training methodology by customizing the losses and metrics as necessitated by the original research paper. The loss is the binary cross-entropy, calculated by comparing the original input image with the reconstructed synthetic (generated) image.

Training the Model

 class VAE(keras.Model):
     def __init__(self, encoder, decoder, **kwargs):
         super(VAE, self).__init__(**kwargs)
         self.encoder = encoder
         self.decoder = decoder
         self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
         self.reconstruction_loss_tracker = keras.metrics.Mean(
             name="reconstruction_loss"
         )
         self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
     @property
     def metrics(self):
         return [
             self.total_loss_tracker,
             self.reconstruction_loss_tracker,
             self.kl_loss_tracker,
         ]
     def train_step(self, data):
         with tf.GradientTape() as tape:
             mean, logvar, z = self.encoder(data)
             reconstruction = self.decoder(z)
             reconstruction_loss = tf.reduce_mean(
                 tf.reduce_sum(
                     keras.losses.binary_crossentropy(data, reconstruction), axis=(1, 2)
                 )
             )
             kl_loss = -0.5 * (1 + logvar - tf.square(mean) - tf.exp(logvar))
             kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
             total_loss = reconstruction_loss + kl_loss
         grads = tape.gradient(total_loss, self.trainable_weights)
         self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
         self.total_loss_tracker.update_state(total_loss)
         self.reconstruction_loss_tracker.update_state(reconstruction_loss)
         self.kl_loss_tracker.update_state(kl_loss)
         return {
             "loss": self.total_loss_tracker.result(),
             "reconstruction_loss": self.reconstruction_loss_tracker.result(),
             "kl_loss": self.kl_loss_tracker.result(),
         }

We have built our model and defined the losses and metrics required to train it. We can compile the model with Adam optimizer and train it over 30 epochs with a batch size of 128.

 vae = VAE(encoder, decoder)
 vae.compile(optimizer=keras.optimizers.Adam())
 history = vae.fit(data, epochs=30, batch_size=128)

A portion of the output:

Sample Image Generation

The model is trained with the input data. It is ready now to generate the images that look close to the original images. To generate the images, we need to sample some mean and variance with which the model can generate the images.

 def plot_latent_space(vae, n=16, figsize=8):
     # display a n*n 2D manifold of fashion data
     digit_size = 28
     scale = 1.0
     figure = np.zeros((digit_size * n, digit_size * n))
     # linearly spaced coordinates corresponding to the 2D plot
     # of digit classes in the latent space
     grid_x = np.linspace(-scale, scale, n)
     grid_y = np.linspace(-scale, scale, n)[::-1]
     for i, yi in enumerate(grid_y):
         for j, xi in enumerate(grid_x):
             z_sample = np.array([[xi, yi]])
             x_decoded = vae.decoder.predict(z_sample)
             digit = x_decoded[0].reshape(digit_size, digit_size)
             figure[
                 i * digit_size : (i + 1) * digit_size,
                 j * digit_size : (j + 1) * digit_size,
             ] = digit
     plt.figure(figsize=(figsize, figsize))
     start_range = digit_size // 2
     end_range = n * digit_size + start_range
     pixel_range = np.arange(start_range, end_range, digit_size)
     sample_range_x = np.round(grid_x, 1)
     sample_range_y = np.round(grid_y, 1)
     plt.xticks(pixel_range, sample_range_x)
     plt.yticks(pixel_range, sample_range_y)
     plt.xlabel("mean: z[0]")
     plt.ylabel("log of variance: z[1]")
     plt.imshow(figure, cmap="jet")
     plt.show()
 plot_latent_space(vae)

Output:

We can interpret the above generation as follows. With a constant variance sampled, we can generate different images by controlling the mean value. Likewise, by controlling the variance value against a fixed mean value, we can generate different images. Thus, image generation is greatly controlled by the sampling process.

Performance Analysis of VAE

Plotting losses will give a better understanding of training performance.

 loss = history.history['loss']
 # plot loss from 4rd epoch onwards
 index = np.arange(3, 30)
 plt.plot(index, loss[3:], 'o-r')
 plt.xticks(np.arange(3, 30, 2))
 plt.xlabel('Epochs')
 plt.ylabel('Total Loss')
 plt.show()

Output:

The losses keep on decreasing even till the end of the 30th epoch. It suggests that the training must be extended for more epochs to obtain better performance.

This notebook contains the above code implementation.

Wrapping Up

This article discussed Image Generation, the various image generation applications, and the famous generative models. In particular, we have explored Variational Autoencoder (VAE) architecture and built it with TensorFlow, trained with Fashion MNIST data, and generated images by sampling mean and variance. Interested readers may try this implementation with different image data, more depth in encoder and decoder architecture (i.e., with more convolution layers and transpose convolution layers, respectively).

References and Further Reading

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Now Run Programs in Real Time with Llama 3 on Groq

Siddharth Jindal

The creator of the first language processing unit (LPU) inference engine, Groq delivers scalable, repeatable inference at up to 10x faster performance.