For the past 25 years, real-time rendering has been driven by continuous hardware improvements. The goal has at all times been to create the best fidelity image possible inside 16 milliseconds. This has fueled significant innovation in graphics hardware, pipelines, and renderers.
However the slowing pace of Moore’s Law mandates the invention of latest computational architectures to maintain pace with the growing demand of real-time applications. Similarly, as traditional graphics methods approach their limits, latest novel techniques are required to realize further improvements in visual fidelity and performance. This creates a fundamental challenge: How can we proceed improving real-time rendering without relying solely on traditional hardware advancements?
Neural shading represents an exciting latest approach—integrating trainable models directly into the graphics pipeline to realize unprecedented quality and performance. This latest technique leverages dedicated AI hardware, resembling NVIDIA’s Tensor Cores, to run these neural networks efficiently in real-time.
On this blog, we’ll allow you to understand the basics and start with this transformative technology.
What’s neural shading?
At its core, neural shading simply means making a part of the graphics pipeline trainable. This might operate on anything with parameters you could train using machine-learning techniques, but most promising are small neural networks which might be executed inline in shaders and work in tandem with the remaining of the renderer.
These small networks will be executed extremely efficiently in real-time, especially with hardware acceleration available through technologies like cooperative vectors. On one hand, this squeezes more efficiency out of existing hardware and gets more complexity on screen without counting on transistors getting smaller. However, making shaders trainable is amazingly useful in its own right—neural shaders can tackle problems which might be quite difficult to resolve with traditional workflows. This adds a practical latest tool to your graphics toolbox that works today.
How does this alteration the approach?
Traditional engineering involves understanding, solving, coding, and executing problems. Nonetheless, some problems lack solutions, or the solutions are too costly for real-time computation, especially in neural shading. That is where optimization helps. As an alternative of direct problem-solving, we use known inputs and outputs to coach a tunable mathematical model, iteratively adjusting parameters until an approximate, practically useful solution is achieved.
Modern neural shading can leverage powerful tools like Slang, a shading language emerging as a key technology in game development. Hosted by Khronos, the standards body that develops and maintains APIs like OpenGL and Vulkan, Slang offers broad platform compatibility, targeting HLSL, SPIR-V, Metal, and more. It incorporates modern language constructs like generics and, crucially for neural shading, supports automatic differentiation (autodiff), which automates complex calculus.
SlangPy is a Python interface to Slang. It provides a comprehensive, moderately low-level graphics API, offering access to core graphics constructs like compute buffers and textures. Highly cross-platform (targeting D3D 12, Vulkan, CUDA, and Metal), SlangPy also includes a functional API that allows direct calls to Slang shader functions from Python.
For hands-on learning, take a look at the Slang introduction lab from SIGGRAPH and downloadable lab materials. It’s also possible to watch the Slang Birds of a Feather session for community discussions and insights.
Where to start out?: A straightforward mipmap example


Let’s start with a concrete example as an example the concepts: the issue of mipmap generation. Traditional mipmaps work well for color textures like albedo maps, which downsample nicely even with easy box filters. Nonetheless, maps that represent geometry or topology typically downsample very poorly because you’ll be able to’t apply the identical easy filter to geometry—you’ll be able to’t say that a peak next to a trough becomes a flat surface.
The naive approach causes artifacts with noisy specular highlights, inventing nonexistent surfaces. This long-studied problem has analytical solutions, resembling Toksvig’s method, which filters normal maps by adjusting roughness based on normal variance in mipmap levels, accounting for geometric complexity at different scales.
While these analytical approaches work well for specific cases, they often require domain-specific knowledge and careful parameter tuning. Neural optimization offers a more general solution—we are able to generate mipmaps that minimize the difference between the downsampled rendering and a reference “ideal” mipmap, learning optimal representations without requiring explicit analytical derivations.
How the optimization works
The optimization process involves two phases:
- Forward phase: Render the perfect output using traditional methods, generate an output, after which measure the difference between the perfect and generated outputs.
- Backward phase: Calculate learn how to adjust the inputs to make the error smaller using automatic differentiation.
The important thing insight is that we are able to use Slang’s autodiff capabilities to mechanically generate the backward derivatives of our entire rendering procedure at compile time. This is way faster and more convenient than manual differentiation, and it at all times keeps the backward derivative in sync when we modify the forward code.
Here’s an easy example of how a developer might approach this in Slang. Note that that is an illustrative example, and it could must be adapted to a selected use case before running, by providing the input texture data, and an implementation of the BRDF function.
// Define our trainable mipmap parameters
struct MaterialParameters
{
GradOutTensor albedo;
GradOutTensor normal;
};
// Our differentiable render function
[Differentiable]
float3 render(int2 pixel, MaterialParameters material, no_diff float3 light_dir, no_diff float3 view_dir)
{
// Vibrant white light
float light_intensity = 5.0;
// Sample very shiny BRDF (it rained today!)
float3 brdf_sample = sample_brdf( // assume we have implemented our BRDF elsewhere
material.get_albedo(pixel), // albedo color
normalize(light_dir), // light direction
normalize(view_dir), // view direction
material.get_normal(pixel), // normal map sample
0.05, // roughness
0.0, // metallic (no metal)
1.0 // specular
);
// Mix light with BRDF sample to get pixel color
return brdf_sample * light_intensity;
}
// Easy box filter downsampling function
float3 downsample(
int2 pixel,
Tensor source)
{
float3 res = 0;
res += source.getv(pixel * 2 + int2(0, 0));
res += source.getv(pixel * 2 + int2(1, 0));
res += source.getv(pixel * 2 + int2(0, 1));
res += source.getv(pixel * 2 + int2(1, 1));
return res * 0.25;
}
// Loss function comparing our mipmap to reference
[Differentiable]
float3 loss(
no_diff int2 pixel,
no_diff float3 reference,
MaterialParameters material,
no_diff float3 light_dir,
no_diff float3 view_dir)
{
float3 color = render(pixel, material,
light_dir, view_dir);
float3 error = color - reference;
return error * error; // Squared error
}
And here’s learn how to call this code with Python/SlangPy:
import slangpy as spy
import pathlib
# Create a tool and cargo the Slang module
device = spy.create_device(
include_paths=[
pathlib.Path(__file__).parent.absolute(),
]
)
module = spy.Module.load_from_file(device, "example.slang")
# Load some materials.
albedo_map = spy.Tensor.load_from_image(device, "PavingStones070_2K.diffuse.jpg", linearize=True)
normal_map = spy.Tensor.load_from_image(device, "PavingStones070_2K.normal.jpg", scale=2, offset=-1)
def downsample(source: spy.Tensor, steps: int) -> spy.Tensor:
for i in range(steps):
dest = spy.Tensor.empty(
device=device,
shape=(source.shape[0] // 2, source.shape[1] // 2),
dtype=source.dtype)
module.downsample(spy.call_id(), source, _result=dest)
source = dest
return source
# Allocate a tensor for output + call the render function
output = spy.Tensor.empty_like(albedo_map)
module.render(pixel = spy.call_id(),
material = {
"albedo": albedo_map,
"normal": normal_map,
},
light_dir = spy.math.normalize(spy.float3(0.2, 0.2, 1.0)),
view_dir = spy.float3(0, 0, 1),
_result = output)
# Downsample the output tensor.
output = downsample(output, 2)
# Reserve it to a file
output_filename = "render_output.png"
output.save_to_image(output_filename)
What we’ve done up to now is take our original input texture, downsample it, and reserve it to a texture. We then have to calculate a result from some trainable parameters, and determine the difference between that result and our original. We’d prefer to train smaller input textures to realize the identical result as our full-resolution reference, so we’ll start by calculating the loss from those:
# Loss between downsampled full res output (the reference),
# and result from quarter res inputs.
loss_output = spy.Tensor.empty_like(output)
module.loss(pixel = spy.call_id(),
material = {
"albedo": downsample(albedo_map, 2),
"normal": downsample(normal_map, 2),
},
reference = output,
light_dir = spy.math.normalize(spy.float3(0.2, 0.2, 1.0)),
view_dir = spy.float3(0, 0, 1),
_result = loss_output)
This code tells us how different our result’s from what we wish, but we don’t know learn how to adjust the parameters to scale back that loss. For that, we’d like to calculate the gradients, and that is where Slang’s autodiff will help us. Let’s add a function to the Slang code to do that:
void calculate_grads(
int2 pixel,
MaterialParameters material,
MaterialParameters ref_material)
{
float3 light_dir = random_direction(); // Assume we have implemented
float3 view_dir = random_direction(); // a properly random direction
// generator
// Render the high-quality reference using our standard render function
float3 reference = render(pixel, ref_material, light_dir, view_dir);
// Backpropagate
bwd_diff(loss)(pixel, reference, material, light_dir, view_dir, 1.0);
}
This latest function uses the loss function that we defined to calculate gradients for every pixel in our material textures, by taking the derivative of that loss function with respect to every of those pixel inputs. Those gradients tell us how we’d like to update our input textures to scale back the loss.
The last step in the method is to update the input textures using those gradients, after which repeat, iteratively getting closer to our ideal. To do that, we’ll need yet another Slang function to perform the update. For this instance, we are able to use an especially easy one, but real-world examples typically use more sophisticated optimizers just like the Adam optimizer.
void optimizer_step(inout float3 parameter, float3 derivative, float learning_rate)
{
parameter -= learning_rate * derivative;
}
We will now repeat these two steps, calculating gradients after which using them to update our texture parameters, until now we have trained a brand new, efficient mipmap. We achieve this with an easy Python loop:
for iteration in range(num_iterations):
# Step 1: Calculate gradients via automatic differentiation
module.calculate_grads(
pixel=spy.call_id(),
material=trainable_material,
ref_material=reference_material
)
# Step 2: Update parameters using the optimizer
module.optimizer_step(
pixel=spy.call_id(),
trainable_material["albedo"],
learning_rate=learning_rate
)
# Repeat for normal map...
Note that the identical rendering code that we use to calculate the colour of our pixels can be used to coach the mipmap parameters. The compiler mechanically generates the gradients for the whole texture, making it easy to coach complex mipmap generation models.
While that is an intentionally minimalistic “toy code” example, you would integrate this approach right into a real-time rendering project as an offline bake to learn higher mipmaps for particularly difficult, non-linear maps. You may even train a shared model per material family and optionally fine-tune it for every asset.
Learning the fundamentals of neural networks in shaders


Moving beyond easy parameter optimization, we are able to embed entire neural networks directly in shaders. A neural network is basically a mathematical function that may approximate complex relationships between inputs and outputs. As an alternative of writing explicit code to compute these relationships, we train the network to learn them mechanically.
Why use neural networks in shaders?
Neural networks excel at several key tasks in graphics:
- Compression: A small network can represent complex textures or materials with far fewer parameters than traditional approaches.
- Approximation: They’ll approximate expensive computations (like complex lighting models) with easy, fast operations.
- Generalization: Once trained, they’ll handle variations and edge cases that will be difficult to program explicitly.
- Optimization: They’ll learn optimal solutions to problems where analytical solutions are unknown or too expensive.
The constructing blocks
The constructing block of any neural network is easy: inputs (floating-point values), weights (tunable parameters), biases (additional tunable parameters), and a nonlinear activation function. The network learns by adjusting these weights and biases to reduce the difference between its predictions and the specified outputs.
Continuing with our concentrate on texture representation for instance, we are able to create an easy network that takes texture UV coordinates as input and generates RGB color output. With just nine parameters (six weights and three biases), we are able to represent what would otherwise require 200,000 floats in a conventional texture.
For this specific network, we’ll use the hyperbolic tangent (tanh()) as our activation function—an easy and customary alternative for neural networks. To coach it, we use an optimization step built into our framework. We’ll see that our Python NetworkParameters class has an optimize() method; this method is a wrapper that calls the adamOptimize() function in our Slang module. This adamOptimize() function is where the actual optimization algorithm is implemented and executed on the GPU—on this case, a basic version of the favored Adam optimizer.
Here’s a basic implementation of a neural network in Slang:
import slangpy;
// Easy activation function (tanh)
[Differentiable]
float activation(float x)
{
return tanh(x);
}
// Easy Adam optimizer for a single parameter
void adamOptimize(
inout float primal, // The parameter to optimize
inout float grad, // The gradient
inout float m_prev, // First moment (running average of gradient)
inout float v_prev, // Second moment (running average of squared gradient)
float learning_rate, // Learning rate
int iteration) // Current iteration number
{
const float ADAM_BETA_1 = 0.9;
const float ADAM_BETA_2 = 0.999;
const float ADAM_EPSILON = 1e-8;
// Update first and second moments
float m = ADAM_BETA_1 * m_prev + (1.0 - ADAM_BETA_1) * grad;
float v = ADAM_BETA_2 * v_prev + (1.0 - ADAM_BETA_2) * (grad * grad);
m_prev = m;
v_prev = v;
// Bias correction
float mHat = m / (1.0f - pow(ADAM_BETA_1, iteration));
float vHat = v / (1.0f - pow(ADAM_BETA_2, iteration));
// Update parameter
primal -= learning_rate * (mHat / (sqrt(vHat) + ADAM_EPSILON));
// Reset gradient
grad = 0;
}
// Network parameters with automatic differentiation support
struct NetworkParameters
{
RWTensor biases;
RWTensor weights;
AtomicTensor biases_grad;
AtomicTensor weights_grad;
[Differentiable]
float get_bias(int neuron)
{
return biases.get({neuron});
}
[Differentiable]
And here’s learn how to arrange and train the network in Python:
import slangpy as spy
import numpy as np
import pathlib
# Create device and cargo the Slang module
device = spy.create_device(
include_paths=[
pathlib.Path(__file__).parent.absolute(),
]
)
module = spy.Module.load_from_file(device, "example.slang")
# Python wrapper for the Slang NetworkParameters struct
class NetworkParameters(spy.InstanceList):
def __init__(self, inputs: int, outputs: int):
super().__init__(module[f"NetworkParameters<{inputs},{outputs}>"])
self.inputs = inputs
self.outputs = outputs
# Biases and weights for the layer.
self.biases = spy.Tensor.from_numpy(device,
np.zeros(outputs).astype('float32'))
self.weights = spy.Tensor.from_numpy(device,
np.random.uniform(-0.5, 0.5, (outputs, inputs)).astype('float32'))
# Gradients for the biases and weights.
self.biases_grad = spy.Tensor.zeros_like(self.biases)
self.weights_grad = spy.Tensor.zeros_like(self.weights)
# Temp data for Adam optimizer.
self.m_biases = spy.Tensor.zeros_like(self.biases)
self.m_weights = spy.Tensor.zeros_like(self.weights)
self.v_biases = spy.Tensor.zeros_like(self.biases)
self.v_weights = spy.Tensor.zeros_like(self.weights)
# Calls the Slang 'optimize' function for biases and weights
def optimize(self, learning_rate: float, optimize_counter: int):
module.adamOptimize(self.biases, self.biases_grad, self.m_biases,
self.v_biases, learning_rate, optimize_counter)
module.adamOptimize(self.weights, self.weights_grad, self.m_weights,
self.v_weights, learning_rate, optimize_counter)
# Create network parameters for a layer with 2 inputs and three outputs
params = NetworkParameters(2, 3)
print(f"Created NetworkParameters with {params.inputs} inputs and {params.outputs} outputs")
print(f"Biases shape: {params.biases.shape}")
print(f"Weights shape: {params.weights.shape}")
print(f"Initial weights:n{params.weights.to_numpy()}")
For more complex networks, you’ll be able to easily add multiple layers:
// Multi-layer network for more complex texture generation
struct Network {
NetworkParameters<2, 32> layer0;
NetworkParameters<32, 32> layer1;
NetworkParameters<32, 3> layer2;
[Differentiable]
float3 eval(no_diff float2 uv)
{
float inputs[2] = {uv.x, uv.y};
float output0[32] = layer0.forward(inputs);
[ForceUnroll]
for (int i = 0; i < 32; ++i)
output0[i] = activation(output0[i]);
float output1[32] = layer1.forward(output0);
[ForceUnroll]
for (int i = 0; i < 32; ++i)
output1[i] = activation(output1[i]);
float output2[3] = layer2.forward(output1);
[ForceUnroll]
for (int i = 0; i < 3; ++i)
output2[i] = activation(output2[i]);
return float3(output2[0], output2[1], output2[2]);
}
}
The fantastic thing about this approach is that the identical autodiff infrastructure that worked for easy parameter optimization now works for neural network training. The compiler mechanically generates the gradients for the whole network, making it easy to coach complex texture generation models.
Key techniques for higher results
Small networks require careful engineering to work well. The techniques that work best depend upon your specific application—what helps with texture generation is probably not optimal for material evaluation or lighting calculations. Listed below are some key techniques that may dramatically improve results for the feel example we’ve been :
- Activation functions: One commonly used activation function in machine learning is the rectified linear unit (ReLU), which simply emits any positive input unchanged, and outputs zero for any negative input. While computationally efficient and effective for a lot of neural shading tasks, it will possibly create piecewise linear outputs as a result of their thresholding at zero. This could result in visible triangular patterns in 2D texture applications. Smoother activations, like exponential functions, often provide higher visual quality for texture generation. The alternative of activation function depends upon the precise use case.
// Some alternative activation functions
[Differentiable]
float3 smoothActivation(float3 x) {
return exp(x); // Exponential activation for smoother output
}
[Differentiable
float3 leakyReLU(float3 x) {
return max(0.1 * x, x); // Leaky ReLU prevents dead neurons
}
- Leaky ReLU: When ReLU outputs zero (for negative inputs), the gradient becomes zero, as well. This means that during backpropagation, no updates are sent back to the weights that feed into that neuron. If a neuron consistently receives negative inputs during training, it can become permanently “dead”—outputting zero and never learning. This is particularly problematic in small networks where losing even a few neurons can hurt performance. Leaky ReLU instead outputs a small negative value (typically 0.01 times the input), ensuring the gradient is never exactly zero, so the neuron continues learning even when its input is negative. The small negative slope keeps the neuron “alive” and responsive to gradient updates.
- Frequency encoding: Instead of directly feeding raw UV coordinates to a neural network, we first pass them through sines and cosines of different frequencies. This improves quality without increasing computational cost. Neural networks struggle to learn high-frequency (fine details, sharp transitions) patterns from low-dimensional inputs. By encoding coordinates as [sin(2πu), cos(2πu), sin(2πv), cos(2πv)], we offer the network with multiple frequency components. This enables it to learn each low-frequency (smooth) and high-frequency patterns, that are especially helpful for spatial inputs like UV coordinates. It’s less useful for non-spatial inputs where frequency content isn’t a primary concern.
// Frequency encoding for higher neural texture representation
float4 encodeUV(float2 uv) {
float4 encoded;
encoded.x = sin(uv.x * 2.0 * 3.14159);
encoded.y = cos(uv.x * 2.0 * 3.14159);
encoded.z = sin(uv.y * 2.0 * 3.14159);
encoded.w = cos(uv.y * 2.0 * 3.14159);
return encoded;
}
// Enhanced network with frequency encoding
[Differentiable]
float3 evaluateNetworkWithEncoding(NeuralNetwork net, float2 uv) {
float4 encoded = encodeUV(uv);
// Now use 4D input as an alternative of 2D
float3 output = float3(0.0, 0.0, 0.0);
for (int i = 0; i < 3; i++) {
output[i] = net.biases[i];
for (int j = 0; j < 4; j++) {
output[i] += net.weights[i * 4 + j] * encoded[j];
}
}
return smoothActivation(output);
}
How hardware accelerates cooperative vectors
Modern GPUs have dedicated Tensor Cores that may efficiently compute matrix multiplications. Nonetheless, using Tensor Cores requires cooperative execution where all threads operate together to compute a matrix multiplication.
Cooperative vectors provide a convenient strategy to access this hardware. They allow you to jot down shader code as normal matrix-vector multiplication, and the compiler mechanically maps it to Tensor Core hardware without requiring explicit packing or uniform control flow.
Here’s learn how to use cooperative vectors for neural network acceleration:
struct FeedForwardLayer
{
ByteAddressBuffer weights;
uint weightsOffset;
ByteAddressBuffer biases;
uint biasesOffset;
CoopVec eval(CoopVec input)
{
let output = coopVecMatMulAdd(
input, CoopVecComponentType.Float32, // input and format
weights, weightsOffset, CoopVecComponentType.Float32, // weights and format
biases, biasesOffset, CoopVecComponentType.Float32, // biases and format
CoopVecMatrixLayout.ColumnMajor, // matrix layout
false, // is matrix transposed
sizeof(float) * InputSize); // matrix stride
return max(CoopVec(0.0f), output); // ReLU activation
}
}
Real-world applications: What are you able to construct?
The techniques we’ve covered form the muse for a lot of exciting applications in neural shading. Listed below are among the most promising areas:
Neural texture compression (NTC)
Neural texture compression represents some of the practical applications of neural shading. Traditional block compression formats like BC1 and BC7 have fundamental limitations, but NTC can deliver much higher quality at similar compression rates or a lot better compression at similar quality levels.
The important thing insight is to make use of a small neural network as a decoder, fed with low-precision latent textures and positional encoding. This approach offers several benefits:
- Variable bit rates: Using a variable variety of latent textures with low bit depth gives a big selection of encoding bit rates (0.5 to twenty bits per pixel).
- Independent decoding: Each pixel will be decoded independently, enabling direct sampling in shaders.
- No hallucinations: Unlike large image-generation models, small networks trained from scratch for every texture don’t produce entirely generated artifacts like extra fingers.
For implementation details and examples, see the NVIDIA neural texture compression library.
Neural materials
Neural materials represent one other powerful application—learning complex, layered materials and distilling them into small networks that run significantly faster than the unique shader code.
The approach involves training a network to take light direction, viewing direction, and latent codes as input, and to offer material color as output. For spatial variation, we train a texture of latent codes that we feed as additional input.
The important thing innovation is using an encoder network during training that translates the unique material textures into latent textures, then baking the result for runtime use. This approach scales to very high texture resolutions (4K and beyond) without the convergence problems with per-texel optimization.
Beyond textures and materials
The principles of neural shading extend far beyond these examples. You possibly can apply similar techniques to:
- Lighting calculations: Approximate complex lighting models with fast neural approximations.
- Post-processing effects: Learn optimal tone mapping, color grading, or stylization effects.
- Geometry processing: Generate or modify geometry procedurally with neural networks.
- Animation: Create smooth interpolations or procedural animations.
- Procedural generation: Generate content algorithmically with learned patterns.
- De-noising for ray tracing: Reduce noise in ray-traced images.
- Animation compression: Compress animation data efficiently with learned representations.
- Mesh simplification: Simplify 3D meshes while preserving detail in the ultimate appearance.
The secret is identifying where you may have expensive computations or complex relationships that may gain advantage from neural approximation.
The long run: Why neural shading matters
Neural shading represents greater than just a brand new technique—it’s a fundamental shift in how we take into consideration real-time graphics. By making shaders trainable, we open up latest possibilities:
- Quality-performance tradeoffs: Networks will be easily adjusted for various quality levels, enabling natural LOD systems.
- Extensible features: Additional learned components will be easily added to networks for tasks like importance sampling or filtering.
- Platform flexibility: The identical neural assets can work across different hardware capabilities, using inference and sample on capable hardware and transcoding on less-capable platforms.
The important thing to success is increase a mental toolbox of optimization techniques and debugging skills specific to neural shading. Just as with all latest technology, there’s a learning curve, however the rewards are substantial.
Getting began: your next steps
The neural shading ecosystem is maturing rapidly. Listed below are the important thing tools and libraries you could start:
To start fast with Slang and autodiff, you’ll be able to try it out in your browser on the Slang Playground. For comprehensive resources and tools, explore the NVIDIA RTX Kit, which incorporates support for neural shading technologies. To dive deeper into the concepts covered on this guide, watch the neural shading course NVIDIA presented at SIGGRAPH.
On the Graphics Programming Conference (GPC) next week, you’ll want to take a look at our Neural Shading for Real-Time Graphics and Path Tracing in Doom the Dark Ages sessions.
The technology is prepared for production use today. Whether you’re working on texture compression, material systems, or entirely latest applications, neural shading provides a robust latest tool for achieving higher quality and higher performance in real-time graphics.
See our full list of game developer resources here and follow us to remain up-to-date with the most recent NVIDIA game development news:
