Guide To NVIDIA’s Kaolin: A 3D Deep Learning Library

NVIDIA's Kaolin is a PyTorch library for all 3D deep learning needs from data preprocessing to model deployment, intending faster research

Share

Published on May 10, 2021

by Rajkumar Lakshmanamoorthy

3D Deep Learning is gaining more importance nowadays with vital application needs in self-driving vehicles, autonomous robots, augmented reality and virtual reality, 3D graphics, and 3D games. Unlike 2D data, 3D data is complex with more parameters and features. Collecting 3D data and transforming it from one representation to another is a tedious process. Thus 3D deep learning is more time consuming and error-prone than 2D Computer Vision. Though there are nicely-performing models, datasets, metrics, graphics tools, and visualization tools published in recent years, integrating different approaches is quite a non-trivial job for researchers and practitioners.

In this scenario, NVIDIA introduced a PyTorch-based library named Kaolin and has recently released its latest optimized version. Kaolin, a one-stop solution for all of the 3D deep learning needs, was published, intending faster research and easy deployment. Kaolin supports 3D deep learning processes from data preprocessing to model building to model deployment with more efficiency.

Though Kaolin was initially released with a great collection of pre-trained neural architectures, the recent optimized release expelled out all those pre-trained models to keep the library simple and focused towards future research. However, a separate repository with the pre-trained models collection could be expected to be officially released soon.

3D Representations in Kaolin

3D data is represented in different ways. The way 3D data being represented makes a great impact on model performance. Many models are developed in a representation-specific manner. For instance, Point Transformer and PointNet are developed to handle point cloud data. MeshCNN is developed to handle polygon mesh data. Further, benchmark datasets are also prepared in a representation-specific manner. Thus, the number of choices one has to pick a pre-trained model, and a public dataset are limited.

Kaolin supports all different 3D representations and provides a module to convert one form of 3D representation into another in a single line of code. Kaolin bridges the gaps between different 3D models and different 3D datasets. With Kaolin, one can process any dataset with any model without any hassle.

A 3D image converted into three different representations by Kaolin (source)

The 3D representations that Kaolin supports are:

Polygon meshes
Point clouds
Voxel grids
Signed distance functions
RGB-D images

3D Datasets in Kaolin

Kaolin had a great collection of public datasets in its initial release that are optimized for efficient tensor processing. That included ShapeNet, PartNet, ModelNet, ScanNet, SHREC, and HumanSeg. However, in its revamped latest release, Kaolin has moved many datasets outside to make the library light and efficient. Presently, Kaolin offers ShapeNet and ModelNet datasets through its io API. Kaolin datasets can be simply loaded using PyTorch’s Dataset and DataLoader methods. Further, custom dataset loading and preprocessing are made easier. To provide a common language for data format, Kaolin supports data in Universal Scene Description (USD) format.

3D images in Kaolin — A few 3D images from DataLoader in Kaolin (source)

Operators and Metrics in Kaolin

Kaolin processes 3D data using its ops API. The modules of this API intend to perform hassle-free batching operations, conversion between different 3D representations, implementing a graph convolution layer, and random tensor generation. Further, representation-specific operations for mesh, voxel grid and point clouds are carried out through the ops API.

In 3D deep learning, metrics are differentiable operators to determine losses and evaluation measures necessary for training and evaluating neural networks. Kaolin’s metrics API provides loss functions, metrics such as IoU (Intersection over Union) and accuracy, and optimizers such as Laplacian smoothing.

Differentiable Rendering in Kaolin

Differentiable rendering is one of the complex 3D processes in which 3D shapes and models are synthesized under supervision from 2D images. This library provides a unified gateway to all kinds of differentiable rendering processes through its render API. It performs operations such as lighting, shading, rasterization and camera projection.

Rendering processes in Kaolin — A few of the rendering processes in Kaolin (source)

Visualization in Kaolin

Visualization of 3D data in Kaolin is simplified despite involving different parameters. There are a few external visualization tools available to visualize 3D data. However, these tools are representation-specific. One tool can not support all representation types. Kaolin’s visualize API is the first tool to support all kinds of 3D representations. It gives simple options to control parameters such as colour, view, projection, lighting, shading, and texturing.

2D-image to 3D-image through differentiable rendering (source)

Generated Chairs by content creation (source)

Code Implementation

Requirements are Python 3.6+, CUDA 10.0+ and NVCC.

Install dependencies using the following command.

!pip install torch scipy cython Pillow usd-core

Install Kaolin and the third-party dependencies from the official source code.

 !git clone --recursive https://github.com/NVIDIAGameWorks/kaolin
 %cd kaolin

Output:

Create the necessary environment using the following command.

!python setup.py develop

A portion of the output:

Test correct version installation using the following command.

 # test installation
 !python -c "import kaolin; print(kaolin.__version__)"

Output:

Install the other requirements and perform a complete test of installation.

 !pip install -r tools/ci_requirements.txt
 !pytest tests/python/

A portion of the output:

Let’s perform an Image Rendering task. Most of this code implementation references this official notebook.

Import necessary libraries and modules.

 import json
 import os
 import glob
 import time
 from PIL import Image
 import torch
 import numpy as np
 from matplotlib import pyplot as plt
 import kaolin as kal

 # path to the rendered image (using the data synthesizer)
 rendered_path = "/content/kaolin/examples/samples/rendered_clock/"
 # path to the output logs (readable with the training visualizer in the omniverse app)
 logs_path = './logs/'
 # We initialize the timelapse that will store USD for the visualization apps
 timelapse = kal.visualize.Timelapse(logs_path)

Set hyperparameters necessary for image rendering.

 # Hyperparameters
 num_epoch = 40
 batch_size = 2
 laplacian_weight = 0.1
 flat_weight = 0.001
 image_weight = 0.1
 mask_weight = 1.
 lr = 5e-2
 scheduler_step_size = 15
 scheduler_gamma = 0.5
 texture_res = 400
 # select camera angle for best visualization
 test_batch_ids = [2, 5, 10]
 test_batch_size = len(test_batch_ids)

Sample data for image rendering is available with source code in zipped format. Unzip it using the following command.

 %cd /content/kaolin/examples/samples/
 !unzip rendered_clock.zip

A portion of the output:

Check for proper extraction.

!ls /content/kaolin/examples/samples -p

Output:

Load the data using the DataLoader method.

 num_views = len(glob.glob(os.path.join(rendered_path,'*_rgb.png')))
 train_data = []
 for i in range(num_views):
     data = kal.io.render.import_synthetic_view(
         rendered_path, i, rgb=True, semantic=True)
     train_data.append(data)
 dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
                                          shuffle=True, pin_memory=True)

Load the in-built sphere template for training.

 mesh = kal.io.obj.import_mesh('/content/kaolin/examples/samples/sphere.obj', with_materials=True)
 # the sphere is usually too small (this is fine-tuned for the clock)
 vertices = mesh.vertices.cuda().unsqueeze(0) * 75
 vertices.requires_grad = True
 faces = mesh.faces.cuda()
 uvs = mesh.uvs.cuda().unsqueeze(0)
 face_uvs_idx = mesh.face_uvs_idx.cuda()
 face_uvs = kal.ops.mesh.index_vertices_by_faces(uvs, face_uvs_idx).detach()
 face_uvs.requires_grad = False
 texture_map = torch.ones((1, 3, texture_res, texture_res), dtype=torch.float, device='cuda',
                          requires_grad=True)
 # The topology of the mesh and the uvs are constant
 # so we can initialize them on the first iteration only
 timelapse.add_mesh_batch(
     iteration=0,
     category='optimized_mesh',
     faces_list=[mesh.faces.cpu()],
     uvs_list=[mesh.uvs.cpu()],
     face_uvs_idx_list=[mesh.face_uvs_idx.cpu()],
 )

Set up the losses and regularizer for training and evaluation.

 ## Separate vertices center as a learnable parameter
 vertices_init = vertices.detach()
 vertices_init.requires_grad = False
 # This is the center of the optimized mesh, separating it as a learnable parameter helps the optimization. 
 vertice_shift = torch.zeros((3,), dtype=torch.float, device='cuda',
                             requires_grad=True)
 def recenter_vertices(vertices, vertice_shift):
     """Recenter vertices on vertice_shift for better optimization"""
     vertices_min = vertices.min(dim=1, keepdim=True)[0]
     vertices_max = vertices.max(dim=1, keepdim=True)[0]
     vertices_mid = (vertices_min + vertices_max) / 2
     vertices = vertices - vertices_mid + vertice_shift
     return vertices
 nb_faces = faces.shape[0]
 nb_vertices = vertices_init.shape[1]
 face_size = 3
 ## Set up auxiliary connectivity matrix of edges to faces indexes for the flat loss
 edges = torch.cat([faces[:,i:i+2] for i in range(face_size - 1)] +
                   [faces[:,[-1,0]]], dim=0)
 edges = torch.sort(edges, dim=1)[0]
 face_ids = torch.arange(nb_faces, device='cuda', dtype=torch.long).repeat(face_size)
 edges, edges_ids = torch.unique(edges, sorted=True, return_inverse=True, dim=0)
 nb_edges = edges.shape[0]
 # edge to faces
 sorted_edges_ids, order_edges_ids = torch.sort(edges_ids)
 sorted_faces_ids = face_ids[order_edges_ids]
 # indices of first occurences of each key
 idx_first = torch.where(
     torch.nn.functional.pad(sorted_edges_ids[1:] != sorted_edges_ids[:-1],
                             (1,0), value=1))[0]
 nb_faces_per_edge = idx_first[1:] - idx_first[:-1]
 # compute sub_idx (2nd axis indices to store the faces)
 offsets = torch.zeros(sorted_edges_ids.shape[0], device='cuda', dtype=torch.long)
 offsets[idx_first[1:]] = nb_faces_per_edge
 sub_idx = (torch.arange(sorted_edges_ids.shape[0], device='cuda', dtype=torch.long) -
            torch.cumsum(offsets, dim=0))
 nb_faces_per_edge = torch.cat([nb_faces_per_edge,
                                sorted_edges_ids.shape[0] - idx_first[-1:]],
                               dim=0)
 max_sub_idx = 2
 edge2faces = torch.zeros((nb_edges, max_sub_idx), device='cuda', dtype=torch.long)
 edge2faces[sorted_edges_ids, sub_idx] = sorted_faces_ids
 ## Set up auxiliary laplacian matrix for the laplacian loss
 vertices_laplacian_matrix = kal.ops.mesh.uniform_laplacian(
     nb_vertices, faces)

Set up Adam optimizer and learning rate scheduler.

 optim  = torch.optim.Adam(params=[vertices, texture_map, vertice_shift],
                           lr=lr)
 scheduler = torch.optim.lr_scheduler.StepLR(optim, step_size=scheduler_step_size,
                                             gamma=scheduler_gamma)

Perform training with above set up for 40 epochs.

 for epoch in range(num_epoch):
     for idx, data in enumerate(dataloader):
         optim.zero_grad()
         gt_image = data['rgb'].cuda()
         gt_mask = data['semantic'].cuda()
         cam_transform = data['metadata']['cam_transform'].cuda()
         cam_proj = data['metadata']['cam_proj'].cuda()
         ### Prepare mesh data with projection regarding to camera ###
         vertices_batch = recenter_vertices(vertices, vertice_shift)
         face_vertices_camera, face_vertices_image, face_normals = \
             kal.render.mesh.prepare_vertices(
                 vertices_batch.repeat(batch_size, 1, 1),
                 faces, cam_proj, camera_transform=cam_transform
             )
         ### Perform Rasterization ###
         # Construct attributes that DIB-R rasterizer will interpolate.
         # the first is the UVS associated to each face
         # the second will make a hard segmentation mask
         face_attributes = [
             face_uvs.repeat(batch_size, 1, 1, 1),
             torch.ones((batch_size, nb_faces, 3, 1), device='cuda')
         ]
         image_features, soft_mask, face_idx = kal.render.mesh.dibr_rasterization(
             gt_image.shape[1], gt_image.shape[2], face_vertices_camera[:, :, :, -1],
             face_vertices_image, face_attributes, face_normals[:, :, -1])
         # image_features is a tuple in composed of the interpolated attributes of face_attributes
         texture_coords, mask = image_features
         image = kal.render.mesh.texture_mapping(texture_coords,
                                                 texture_map.repeat(batch_size, 1, 1, 1), 
                                                 mode='bilinear')
         image = torch.clamp(image * mask, 0., 1.)
         ### Compute Losses ###
         image_loss = torch.mean(torch.abs(image - gt_image))
         mask_loss = kal.metrics.render.mask_iou(soft_mask,
                                                 gt_mask.squeeze(-1))
         # laplacian loss
         vertices_mov = vertices - vertices_init
         vertices_mov_laplacian = torch.matmul(vertices_laplacian_matrix, vertices_mov)
         laplacian_loss = torch.mean(vertices_mov_laplacian ** 2) * nb_vertices * 3
         # flat loss
         mesh_normals_e1 = face_normals[:, edge2faces[:, 0]]
         mesh_normals_e2 = face_normals[:, edge2faces[:, 1]]
         faces_cos = torch.sum(mesh_normals_e1 * mesh_normals_e2, dim=2)
         flat_loss = torch.mean((faces_cos - 1) ** 2) * edge2faces.shape[0]
         loss = (
             image_loss * image_weight +
             mask_loss * mask_weight +
             laplacian_loss * laplacian_weight +
             flat_loss * flat_weight
         )
         ### Update the mesh ###
         loss.backward()
         optim.step()
     scheduler.step()
     print(f"Epoch {epoch} - loss: {float(loss)}")
     ### Write 3D Checkpoints ###
     pbr_material = [
         {'rgb': kal.io.materials.PBRMaterial(diffuse_texture=torch.clamp(texture_map[0], 0., 1.))}
     ]
     vertices_batch = recenter_vertices(vertices, vertice_shift)
     # We are now adding a new state of the mesh to the timelapse
     # we only modify the texture and the vertices position
     timelapse.add_mesh_batch(
         iteration=epoch,
         category='optimized_mesh',
         vertices_list=[vertices_batch[0]],
         materials_list=pbr_material
     )

A portion of the output:

Visualize the rendered image.

 with torch.no_grad():
     # This is similar to a training iteration (without the loss part)
     data_batch = [train_data[idx] for idx in test_batch_ids]
     cam_transform = torch.stack([data['metadata']['cam_transform'] for data in data_batch], dim=0).cuda()
     cam_proj = torch.stack([data['metadata']['cam_proj'] for data in data_batch], dim=0).cuda()
     vertices_batch = recenter_vertices(vertices, vertice_shift)
     face_vertices_camera, face_vertices_image, face_normals = \
         kal.render.mesh.prepare_vertices(
             vertices_batch.repeat(test_batch_size, 1, 1),
             faces, cam_proj, camera_transform=cam_transform
         )
     face_attributes = [
         face_uvs.repeat(test_batch_size, 1, 1, 1),
         torch.ones((test_batch_size, nb_faces, 3, 1), device='cuda')
     ]
     image_features, soft_mask, face_idx = kal.render.mesh.dibr_rasterization(
         256, 256, face_vertices_camera[:, :, :, -1],
         face_vertices_image, face_attributes, face_normals[:, :, -1])
     texture_coords, mask = image_features
     image = kal.render.mesh.texture_mapping(texture_coords,
                                             texture_map.repeat(test_batch_size, 1, 1, 1), 
                                             mode='bilinear')
     image = torch.clamp(image * mask, 0., 1.)
     ## Display the rendered images
     f, axarr = plt.subplots(1, test_batch_size, figsize=(7, 22))
     f.subplots_adjust(top=0.99, bottom=0.79, left=0., right=1.4)
     f.suptitle('DIB-R rendering', fontsize=30)
     for i in range(test_batch_size):
         axarr[i].imshow(image[i].cpu().detach())
 ## Display the texture
 plt.figure(figsize=(10, 10))
 plt.title('2D Texture Map', fontsize=30)
 plt.imshow(torch.clamp(texture_map[0], 0., 1.).cpu().detach().permute(1, 2, 0))

Output:

Wrapping Up

This article has discussed NVIDIA’s Kaolin library, the complete solution for 3D deep learning tasks. We have discussed the features of the Kaolin library along with some code implementation for the library installation and a sample image rendering task.