Who should read this text?
This text goals to supply a basic beginner level understanding of NeRF’s workings through visual representations. While various blogs offer detailed explanations of NeRF, these are sometimes geared toward readers with a robust technical background in volume rendering and 3D graphics. In contrast, this text seeks to elucidate NeRF with minimal prerequisite knowledge, with an optional technical snippet at the top for curious readers. For those taken with the mathematical details behind NeRF, an inventory of further readings is provided at the top.
What’s NeRF and How Does It Work?
NeRF, short for Neural Radiance Fields, is a 2020 paper introducing a novel method for rendering 2D images from 3D scenes. Traditional approaches depend on physics-based, computationally intensive techniques comparable to ray casting and ray tracing. These involve tracing a ray of sunshine from each pixel of the 2D image back to the scene particles to estimate the pixel color. While these methods offer high accuracy (e.g., images captured by phone cameras closely approximate what the human eye perceives from the identical angle), they are sometimes slow and require significant computational resources, comparable to GPUs, for parallel processing. Because of this, implementing these methods on edge devices with limited computing capabilities is sort of unimaginable.
NeRF addresses this issue by functioning as a scene compression method. It uses an overfitted multi-layer perceptron (MLP) to encode scene information, which may then be queried from any viewing direction to generate a 2D-rendered image. When properly trained, NeRF significantly reduces storage requirements; for instance, a straightforward 3D scene can typically be compressed into about 5MB of knowledge.
At its core, NeRF answers the next query using an MLP:
What’s going to I see if I view the scene from this direction?
This query is answered by providing the viewing direction (when it comes to two angles (θ, φ), or a unit vector) to the MLP as input, and MLP provides RGB (directional emitted color) and volume density, which is then processed through volumetric rendering to provide the ultimate RGB value that the pixel sees. To create a picture of a certain resolution (say HxW), the MLP is queried HxW times for every pixel’s viewing direction, and the image is created. For the reason that release of the primary NeRF paper, quite a few updates have been made to boost rendering quality and speed. Nonetheless, this blog will deal with the unique NeRF paper.
Step 1: Multi-view input images
NeRF needs various images from different viewing angles to compress a scene. MLP learns to interpolate these images for unseen viewing directions (novel views). The knowledge on the viewing direction for a picture is provided using the camera’s intrinsic and extrinsic matrices. The more images spanning a big selection of viewing directions, the higher the NeRF reconstruction of the scene is. Briefly, the essential NeRF takes input camera images, and their associated camera intrinsic and extrinsic matrices. (You’ll be able to learn more in regards to the camera matrices within the blog below)
Step2 to 4: Sampling, Pixel iteration, and Ray casting
Each image within the input images is processed independently (for the sake of simplicity). From the input, a picture and its associated camera matrices are sampled. For every camera image pixel, a ray is traced from the camera center to the pixel and prolonged outwards. If the camera center is defined as o, and the viewing direction as directional vector d, then the ray r(t) will be defined as r(t)=o+td where t is the gap of the purpose r(t) from the middle of the camera.
Ray casting is completed to discover the parts of the scene that contribute to the colour of the pixel.