What Is NeX? Guide To Real-Time View Synthesis With Python Code

NeX is a new scene representation based on MPI that models view-dependent effects by performing basis expansion on the pixel representation.

Share

Published on April 2, 2021

by Aditya Singh

NeX is a new scene representation based on the multiplane image (MPI) that models view-dependent effects by performing basis expansion on the pixel representation. Rather than simply storing static colour values as in a traditional MPI, NeX represents each colour as a function of the viewing angle and approximates this function using a linear combination of learnable spherical basis functions. Moreover, it uses a hybrid parameter modeling strategy that models high-frequency details in an explicit structure within an implicit MPI modeling framework. This helps improve fine details that are difficult to model by a neural network and produces sharper results in fewer training iterations. NeX also introduced a new dataset, Shiny, designed to test the limits of view-dependent modeling with significantly more challenging effects such as rainbow reflections on a CD and refraction through a test tube.

Approach & Architecture

A multiplane image(MPI) is a 3D scene representation consisting of a collection of D planar images, each with dimension H × W × 4 where the last dimension contains RGB values and alpha transparency values. The planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.

An RGBα MPI can be rendered in any target view by first warping all its planes to the target view via a homography that relates the reference and target view, and then applying the composite operator. Let c_i ∈ R H×W×3 and α_i ∈ R H×W×1 be the RGB and alpha “images” of the i^th plane respectively, ordered from back to front. And A = {α₁, α₂, …, α_D}, C = {c₁, c₂, …, c_D} be the sets of these images. This MPI can be rendered in a new view using the composite operator O:

Here W is the homography warping function and O is:

One main limitation of multiplane images is that they can only model Lambertian surfaces, i.e., surfaces whose colours appear constant regardless of the viewing angle. In real-world scenarios, many objects are non-Lambertian such as a CD, a glass table, or a metal spoon. These objects exhibit view-dependent effects such as reflection and refraction. Reconstructing these objects with an MPI makes the objects appear unrealistically dull without reflections or even break down completely due to the violation of the brightness constancy assumption used for matching invariant and 3D reconstruction.

NeX versus standard MPI performance on a scene with CD

To allow for view-dependent modeling in NeX, the pixel color representation is modified by parameterizing each color value as a function of the viewing direction v = (v_x, v_y, v_z). This results in a 3-dimensional mapping function C(v): R ³ → R ³ for every pixel. However, storing this mapping explicitly is limiting and not generalizable to new, unobserved angles. Regressing the color directly from v (and the pixel location) with a neural network is possible but inefficient for real-time rendering. The key idea behind NeX is to approximate this function with a linear combination of learnable basis functions {H_n(v) : R³ → R} over the spherical domain described by vector v:

Here k^p_n ∈ R³ for pixel p are RGB coefficients, or reflectance parameters, of N global basis functions. There are several ways to define a suitable set of basis functions, spherical harmonics basis is one common choice used heavily in computer graphics. Fourier’s basis or Taylor’s basis can also be used.

PSNR score of NeX neural basis function versus other fixed basis functions. — PSNR scores versus the number of basis coefficients for NeX (neural basis functions), Fourier’s series (FS), Jacobi spherical harmonics(JH), hemispherical harmonics (HSH), spherical harmonics (SH), and Taylor’s series (TS).

However, these “fixed” basis functions have one shortcoming: the number of basis functions required to capture high-frequency changes within a narrow viewing angle can be very high. This in turn requires more reflectance parameters which make both learning these parameters and rendering them more difficult. With learnable basis functions, the modified NeX MPI outperforms other versions with alternative basis functions that use the same number of coefficients.

NeX uses two separate MLPs; one for predicting per-pixel parameters given the pixel location, and the other for predicting all global basis functions given the viewing angle. The motivation for using the second network is to ensure that the prediction of the basis functions, which are global, is not a function of the pixel location. The first MLP is modeled as F_θ with parameter θ:

Here x = (x, y, d) contains the location information of pixel (x, y) at plane d. The second network is modeled as G_ɸ with parameter ɸ:

Here v is the normalized viewing direction.

Fine details are lost when using a traditional MLP to model k_n, or “coefficient images”. In view-synthesis problems, these fine details tend to come from the surface texture itself and not necessarily from complex scene geometry. NeX uses positional encoding to regress these images, which helps to an extent but still produces blurry results. Amidst experimentation, the authors stumbled upon a simple fix; storing the first coefficient k₀, or “base color,” explicitly reduced the network’s burden of compressing and reproducing detail and led to sharper results, in fewer iterations. With this implicit-explicit modeling strategy, NeX predicts every parameter with MLPs except k₀ which is optimized explicitly as a learnable parameter with a total variation regularize.

Real-time View Synthesis using NeX

Requirements

Install the COLMAP and lpips. FFmpeg and other Python dependencies are already installed in Colab.

 !pip install lpips
 !apt install colmap

Clone the NeX GitHub repository and navigate into the newly created nex-code directory.

 !git clone https://github.com/nex-mpi/nex-code
 !cd nex-code

Select a scene, make running directories and download the selected dataset from OneDrive.

You can also use your own images but you’ll need at least 12 images in order for NeX to work. In addition to that, downscaling the images to 400-pixel width is recommended for fast upload and training.

 scene_urls = {
         'cake': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/ESg8LNsTqmtFmKO-9X4dUsUBVgfw_TbuAheVAEKnsiouug?download=1',
         'crest': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EYqAlbiZqO1GsiAg-HgEi34B3cBL3tuaFQxg5fyrV5Prew??download=1',
         'giants':  'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EUx6wPzSVRtMhpinHKF9ArcBE_4c98xxJLAGSCaM54MiJQ?download=1',
         'room': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/ERVHMv2NeOtKgFLGRJ22jgMBdo3BqCQIfd27MFgLvNOW5w?download=1',
         'seasoning': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EedXEIqliIZGk-6fxd-cb9cBsUjidu9G5du1TIYOF5FOyQ?download=1',
         'sushi': 'https://vistec-my.sharepoint.com/:u:/g/personal/pakkapon_p_s19_vistec_ac_th/EZZA-3nyCBVLtIra5yMZzC0BFx3f4wqg1cm8rKzTAt2x0g?download=1',
     }


 scene = “room”
 onedrive_dataset =scene_url[scene]

 # make directories for running
 !mkdir -p data/demo
 !mkdir -p runs 

 # download the dataset
 get_ipython().system_raw('wget -O data/demo/data.zip {}'.format(onedrive_dataset))
 get_ipython().system_raw('unzip -o -d data/demo/ data/demo/data.zip')
 get_ipython().system_raw('rm data/demo/data.zip')

Set parameters for training.

 epochs =  40
 image_width =  400 
 import math
 pos_level = math.ceil(math.log(image_width) / math.log(2))
 num_offset = int(image_width / 5.0)
 web_width = 4096 if image_width <= 400 else 16000

Train NeX on the downloaded images

!python train.py -scene data/demo -model_dir demo -layers 12 -sublayers 6 -epochs $epochs -offset $num_offset -tb_toc 1 -hidden 128 -pos_level $pos_level -depth_level 7 -tb_saveimage 2 -num_workers 2 -llff_width $image_width -web_width=$web_width

Training will take around 10 minutes for preset images and 20 minutes for new (your) images.

Display the generated video.

 from IPython.display import HTML
 from base64 import b64encode
 video_path = "runs/video_output/demo/video.mp4"
 mp4 = open(video_path, "rb").read()
 data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
 HTML(f"""
 <video width=400 controls>
       <source src="{data_url}" type="video/mp4" controls playsinline autoplay muted loop>
 </video>
 """)

Last Epoch (Endnote)

Performance of NeX on different datasets

This article discussed NeX, a new approach to novel view synthesis using multiplane image (MPI) with neural basis expansion. Although NeX is effective in capturing and reproducing complex view-dependent effects, it is based on MPI and inherits MPIs limitations. When viewing from an angle too far away from the center, there are “stack of cards” artifacts that expose individual MPI planes. NeX still cannot fully reproduce the hardest scenes in the Shiny dataset which include effects like light sparkles, extremely sharp highlights, or refraction through test tubes.