3D Deep Learning is gaining more importance nowadays with vital application needs in self-driving vehicles, autonomous robots, augmented reality and virtual reality, 3D graphics, and 3D games. Unlike 2D data, 3D data is complex with more parameters and features. Collecting 3D data and transforming it from one representation to another is a tedious process. Thus 3D deep learning is more time consuming and error-prone than 2D Computer Vision. Though there are nicely-performing models, datasets, metrics, graphics tools, and visualization tools published in recent years, integrating different approaches is quite a non-trivial job for researchers and practitioners.
In this scenario, NVIDIA introduced a PyTorch-based library named Kaolin and has recently released its latest optimized version. Kaolin, a one-stop solution for all of the 3D deep learning needs, was published, intending faster research and easy deployment. Kaolin supports 3D deep learning processes from data preprocessing to model building to model deployment with more efficiency.
Though Kaolin was initially released with a great collection of pre-trained neural architectures, the recent optimized release expelled out all those pre-trained models to keep the library simple and focused towards future research. However, a separate repository with the pre-trained models collection could be expected to be officially released soon.
3D Representations in Kaolin
3D data is represented in different ways. The way 3D data being represented makes a great impact on model performance. Many models are developed in a representation-specific manner. For instance, Point Transformer and PointNet are developed to handle point cloud data. MeshCNN is developed to handle polygon mesh data. Further, benchmark datasets are also prepared in a representation-specific manner. Thus, the number of choices one has to pick a pre-trained model, and a public dataset are limited.
Kaolin supports all different 3D representations and provides a module to convert one form of 3D representation into another in a single line of code. Kaolin bridges the gaps between different 3D models and different 3D datasets. With Kaolin, one can process any dataset with any model without any hassle.
The 3D representations that Kaolin supports are:
- Polygon meshes
- Point clouds
- Voxel grids
- Signed distance functions
- RGB-D images
3D Datasets in Kaolin
Kaolin had a great collection of public datasets in its initial release that are optimized for efficient tensor processing. That included ShapeNet, PartNet, ModelNet, ScanNet, SHREC, and HumanSeg. However, in its revamped latest release, Kaolin has moved many datasets outside to make the library light and efficient. Presently, Kaolin offers ShapeNet and ModelNet datasets through its io
API. Kaolin datasets can be simply loaded using PyTorch’s Dataset
and DataLoader
methods. Further, custom dataset loading and preprocessing are made easier. To provide a common language for data format, Kaolin supports data in Universal Scene Description (USD) format.
Operators and Metrics in Kaolin
Kaolin processes 3D data using its ops
API. The modules of this API intend to perform hassle-free batching operations, conversion between different 3D representations, implementing a graph convolution layer, and random tensor generation. Further, representation-specific operations for mesh, voxel grid and point clouds are carried out through the ops
API.
In 3D deep learning, metrics are differentiable operators to determine losses and evaluation measures necessary for training and evaluating neural networks. Kaolin’s metrics API provides loss functions, metrics such as IoU (Intersection over Union) and accuracy, and optimizers such as Laplacian smoothing.
Differentiable Rendering in Kaolin
Differentiable rendering is one of the complex 3D processes in which 3D shapes and models are synthesized under supervision from 2D images. This library provides a unified gateway to all kinds of differentiable rendering processes through its render
API. It performs operations such as lighting, shading, rasterization and camera projection.
Visualization in Kaolin
Visualization of 3D data in Kaolin is simplified despite involving different parameters. There are a few external visualization tools available to visualize 3D data. However, these tools are representation-specific. One tool can not support all representation types. Kaolin’s visualize
API is the first tool to support all kinds of 3D representations. It gives simple options to control parameters such as colour, view, projection, lighting, shading, and texturing.
Code Implementation
Requirements are Python 3.6+, CUDA 10.0+ and NVCC.
Install dependencies using the following command.
!pip install torch scipy cython Pillow usd-core
Install Kaolin and the third-party dependencies from the official source code.
!git clone --recursive https://github.com/NVIDIAGameWorks/kaolin %cd kaolin
Output:
Create the necessary environment using the following command.
!python setup.py develop
A portion of the output:
Test correct version installation using the following command.
# test installation !python -c "import kaolin; print(kaolin.__version__)"
Output:
Install the other requirements and perform a complete test of installation.
!pip install -r tools/ci_requirements.txt !pytest tests/python/
A portion of the output:
Let’s perform an Image Rendering task. Most of this code implementation references this official notebook.
Import necessary libraries and modules.
import json import os import glob import time from PIL import Image import torch import numpy as np from matplotlib import pyplot as plt import kaolin as kal
# path to the rendered image (using the data synthesizer) rendered_path = "/content/kaolin/examples/samples/rendered_clock/" # path to the output logs (readable with the training visualizer in the omniverse app) logs_path = './logs/' # We initialize the timelapse that will store USD for the visualization apps timelapse = kal.visualize.Timelapse(logs_path)
Set hyperparameters necessary for image rendering.
# Hyperparameters num_epoch = 40 batch_size = 2 laplacian_weight = 0.1 flat_weight = 0.001 image_weight = 0.1 mask_weight = 1. lr = 5e-2 scheduler_step_size = 15 scheduler_gamma = 0.5 texture_res = 400 # select camera angle for best visualization test_batch_ids = [2, 5, 10] test_batch_size = len(test_batch_ids)
Sample data for image rendering is available with source code in zipped format. Unzip it using the following command.
%cd /content/kaolin/examples/samples/ !unzip rendered_clock.zip
A portion of the output:
Check for proper extraction.
!ls /content/kaolin/examples/samples -p
Output:
Load the data using the DataLoader
method.
num_views = len(glob.glob(os.path.join(rendered_path,'*_rgb.png'))) train_data = [] for i in range(num_views): data = kal.io.render.import_synthetic_view( rendered_path, i, rgb=True, semantic=True) train_data.append(data) dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True, pin_memory=True)
Load the in-built sphere template for training.
mesh = kal.io.obj.import_mesh('/content/kaolin/examples/samples/sphere.obj', with_materials=True) # the sphere is usually too small (this is fine-tuned for the clock) vertices = mesh.vertices.cuda().unsqueeze(0) * 75 vertices.requires_grad = True faces = mesh.faces.cuda() uvs = mesh.uvs.cuda().unsqueeze(0) face_uvs_idx = mesh.face_uvs_idx.cuda() face_uvs = kal.ops.mesh.index_vertices_by_faces(uvs, face_uvs_idx).detach() face_uvs.requires_grad = False texture_map = torch.ones((1, 3, texture_res, texture_res), dtype=torch.float, device='cuda', requires_grad=True) # The topology of the mesh and the uvs are constant # so we can initialize them on the first iteration only timelapse.add_mesh_batch( iteration=0, category='optimized_mesh', faces_list=[mesh.faces.cpu()], uvs_list=[mesh.uvs.cpu()], face_uvs_idx_list=[mesh.face_uvs_idx.cpu()], )
Set up the losses and regularizer for training and evaluation.
## Separate vertices center as a learnable parameter vertices_init = vertices.detach() vertices_init.requires_grad = False # This is the center of the optimized mesh, separating it as a learnable parameter helps the optimization. vertice_shift = torch.zeros((3,), dtype=torch.float, device='cuda', requires_grad=True) def recenter_vertices(vertices, vertice_shift): """Recenter vertices on vertice_shift for better optimization""" vertices_min = vertices.min(dim=1, keepdim=True)[0] vertices_max = vertices.max(dim=1, keepdim=True)[0] vertices_mid = (vertices_min + vertices_max) / 2 vertices = vertices - vertices_mid + vertice_shift return vertices nb_faces = faces.shape[0] nb_vertices = vertices_init.shape[1] face_size = 3 ## Set up auxiliary connectivity matrix of edges to faces indexes for the flat loss edges = torch.cat([faces[:,i:i+2] for i in range(face_size - 1)] + [faces[:,[-1,0]]], dim=0) edges = torch.sort(edges, dim=1)[0] face_ids = torch.arange(nb_faces, device='cuda', dtype=torch.long).repeat(face_size) edges, edges_ids = torch.unique(edges, sorted=True, return_inverse=True, dim=0) nb_edges = edges.shape[0] # edge to faces sorted_edges_ids, order_edges_ids = torch.sort(edges_ids) sorted_faces_ids = face_ids[order_edges_ids] # indices of first occurences of each key idx_first = torch.where( torch.nn.functional.pad(sorted_edges_ids[1:] != sorted_edges_ids[:-1], (1,0), value=1))[0] nb_faces_per_edge = idx_first[1:] - idx_first[:-1] # compute sub_idx (2nd axis indices to store the faces) offsets = torch.zeros(sorted_edges_ids.shape[0], device='cuda', dtype=torch.long) offsets[idx_first[1:]] = nb_faces_per_edge sub_idx = (torch.arange(sorted_edges_ids.shape[0], device='cuda', dtype=torch.long) - torch.cumsum(offsets, dim=0)) nb_faces_per_edge = torch.cat([nb_faces_per_edge, sorted_edges_ids.shape[0] - idx_first[-1:]], dim=0) max_sub_idx = 2 edge2faces = torch.zeros((nb_edges, max_sub_idx), device='cuda', dtype=torch.long) edge2faces[sorted_edges_ids, sub_idx] = sorted_faces_ids ## Set up auxiliary laplacian matrix for the laplacian loss vertices_laplacian_matrix = kal.ops.mesh.uniform_laplacian( nb_vertices, faces)
Set up Adam optimizer and learning rate scheduler.
optim = torch.optim.Adam(params=[vertices, texture_map, vertice_shift], lr=lr) scheduler = torch.optim.lr_scheduler.StepLR(optim, step_size=scheduler_step_size, gamma=scheduler_gamma)
Perform training with above set up for 40 epochs.
for epoch in range(num_epoch): for idx, data in enumerate(dataloader): optim.zero_grad() gt_image = data['rgb'].cuda() gt_mask = data['semantic'].cuda() cam_transform = data['metadata']['cam_transform'].cuda() cam_proj = data['metadata']['cam_proj'].cuda() ### Prepare mesh data with projection regarding to camera ### vertices_batch = recenter_vertices(vertices, vertice_shift) face_vertices_camera, face_vertices_image, face_normals = \ kal.render.mesh.prepare_vertices( vertices_batch.repeat(batch_size, 1, 1), faces, cam_proj, camera_transform=cam_transform ) ### Perform Rasterization ### # Construct attributes that DIB-R rasterizer will interpolate. # the first is the UVS associated to each face # the second will make a hard segmentation mask face_attributes = [ face_uvs.repeat(batch_size, 1, 1, 1), torch.ones((batch_size, nb_faces, 3, 1), device='cuda') ] image_features, soft_mask, face_idx = kal.render.mesh.dibr_rasterization( gt_image.shape[1], gt_image.shape[2], face_vertices_camera[:, :, :, -1], face_vertices_image, face_attributes, face_normals[:, :, -1]) # image_features is a tuple in composed of the interpolated attributes of face_attributes texture_coords, mask = image_features image = kal.render.mesh.texture_mapping(texture_coords, texture_map.repeat(batch_size, 1, 1, 1), mode='bilinear') image = torch.clamp(image * mask, 0., 1.) ### Compute Losses ### image_loss = torch.mean(torch.abs(image - gt_image)) mask_loss = kal.metrics.render.mask_iou(soft_mask, gt_mask.squeeze(-1)) # laplacian loss vertices_mov = vertices - vertices_init vertices_mov_laplacian = torch.matmul(vertices_laplacian_matrix, vertices_mov) laplacian_loss = torch.mean(vertices_mov_laplacian ** 2) * nb_vertices * 3 # flat loss mesh_normals_e1 = face_normals[:, edge2faces[:, 0]] mesh_normals_e2 = face_normals[:, edge2faces[:, 1]] faces_cos = torch.sum(mesh_normals_e1 * mesh_normals_e2, dim=2) flat_loss = torch.mean((faces_cos - 1) ** 2) * edge2faces.shape[0] loss = ( image_loss * image_weight + mask_loss * mask_weight + laplacian_loss * laplacian_weight + flat_loss * flat_weight ) ### Update the mesh ### loss.backward() optim.step() scheduler.step() print(f"Epoch {epoch} - loss: {float(loss)}") ### Write 3D Checkpoints ### pbr_material = [ {'rgb': kal.io.materials.PBRMaterial(diffuse_texture=torch.clamp(texture_map[0], 0., 1.))} ] vertices_batch = recenter_vertices(vertices, vertice_shift) # We are now adding a new state of the mesh to the timelapse # we only modify the texture and the vertices position timelapse.add_mesh_batch( iteration=epoch, category='optimized_mesh', vertices_list=[vertices_batch[0]], materials_list=pbr_material )
A portion of the output:
Visualize the rendered image.
with torch.no_grad(): # This is similar to a training iteration (without the loss part) data_batch = [train_data[idx] for idx in test_batch_ids] cam_transform = torch.stack([data['metadata']['cam_transform'] for data in data_batch], dim=0).cuda() cam_proj = torch.stack([data['metadata']['cam_proj'] for data in data_batch], dim=0).cuda() vertices_batch = recenter_vertices(vertices, vertice_shift) face_vertices_camera, face_vertices_image, face_normals = \ kal.render.mesh.prepare_vertices( vertices_batch.repeat(test_batch_size, 1, 1), faces, cam_proj, camera_transform=cam_transform ) face_attributes = [ face_uvs.repeat(test_batch_size, 1, 1, 1), torch.ones((test_batch_size, nb_faces, 3, 1), device='cuda') ] image_features, soft_mask, face_idx = kal.render.mesh.dibr_rasterization( 256, 256, face_vertices_camera[:, :, :, -1], face_vertices_image, face_attributes, face_normals[:, :, -1]) texture_coords, mask = image_features image = kal.render.mesh.texture_mapping(texture_coords, texture_map.repeat(test_batch_size, 1, 1, 1), mode='bilinear') image = torch.clamp(image * mask, 0., 1.) ## Display the rendered images f, axarr = plt.subplots(1, test_batch_size, figsize=(7, 22)) f.subplots_adjust(top=0.99, bottom=0.79, left=0., right=1.4) f.suptitle('DIB-R rendering', fontsize=30) for i in range(test_batch_size): axarr[i].imshow(image[i].cpu().detach()) ## Display the texture plt.figure(figsize=(10, 10)) plt.title('2D Texture Map', fontsize=30) plt.imshow(torch.clamp(texture_map[0], 0., 1.).cpu().detach().permute(1, 2, 0))
Output:
Wrapping Up
This article has discussed NVIDIA’s Kaolin library, the complete solution for 3D deep learning tasks. We have discussed the features of the Kaolin library along with some code implementation for the library installation and a sample image rendering task.
References:
- Official Website
- Documentation (version 0.9.0)
- Source Code Repo (version 0.9.0)
- Original Research Paper (version 0.1.0)