AI developer activity on PCs is exploding, driven by the rising quality of small language models (SLMs) and diffusion models, akin to FLUX.2, GPT-OSS-20B, and Nemotron 3 Nano. At the identical time, AI PC frameworks, including ComfyUI, llama.cpp, Ollama, and Unsloth are making functional advances, doubling in popularity over the past yr because the variety of developers using PC-class models has grown tenfold. Developers aren’t any longer experimenting with generative AI workflows—they’re constructing the next-generation software stack on NVIDIA GPUs, from the info center to NVIDIA RTX AI PCs.
At CES 2026, NVIDIA is announcing several latest updates for the AI PC developer ecosystem, including:
- Acceleration for the highest open source tools on PC, llama.cpp, and Ollama for SLMs, together with ComfyUI for diffusion models.
- Optimizations to the highest open source models for NVIDIA GPUs, including the brand new LTX-2 audio-video model.
- A collection of tools to speed up agentic AI workflows on RTX PCs and NVIDIA DGX Spark.
Accelerated inference through open source AI frameworks
NVIDIA collaborated with the open-source community to spice up inference performance across the AI PC stack.
Continued performance improvements on ComfyUI
On the diffusion front, ComfyUI optimized performance on NVIDIA GPUs through PyTorch-CUDA and enabled support for NVFP4 and FP8 formats. These quantized formats enable memory savings of 60% and 40%, respectively, and speed up performance. Developers will see a mean of 3x performance with NVFP4 and 2x with NVFP8.


Updates to ComfyUI include:
- NVFP4 support: Linear layers can run using the NVFP4 format with optimized kernels, delivering 3–4x higher throughput in comparison with FP16 and BF16 linear layers.
- Fused FP8 quantization kernels: Boost model performance by eliminating memory-bandwidth-bound operations.
- Fused FP8 de-quantization kernels: Performance for FP8 workloads is further improved on NVIDIA RTX GPUs without fourth-generation Tensor Cores (pre NVIDIA Ada.)
- Weight streaming: Leveraging concurrent system memory and CPU compute streams, weight streaming hides memory latency and increases throughput, especially on GPUs with limited VRAM.
- Mixed precision support: Models can mix multiple numerical formats inside a single network, enabling fine-grained tuning for optimal accuracy and performance.
- RMS & RoPE Fusion: Common, memory-bandwidth-limited operators in diffusion transformers are fused to cut back memory usage and latency. This optimization advantages all DiT models across data types.
The sample code for the optimizations is on the market under the ComfyUI kitchen repository. NVFP4 and FP8 checkpoints are also available in HuggingFace, including the brand new LTX-2, FLUX.2, FLUX.1-dev, FLUX.1-Kontext, Qwen-Image and Z-Image.
Acceleration on RTX AI PCS for llama.cpp and Ollama
For SLMs, token generation throughput performance on mixture-of-expert (MoE) models has increased by 35% on llama.cpp on NVIDIA GPUs, and 30% on Ollama on RTX PCs.


Jan’26 builds are run with the next environment variables and flags: GGML_CUDA_GRAPH_OPT=1, FA=ON, and —backend-sampling
Updates to llama.cpp include:
- GPU token sampling: Offloads several sampling algorithms (TopK, TopP, Temperature, minK, minP, and multi-sequence sampling) to the GPU, improving quality, consistency, and accuracy of responses, while also increasing performance.
- Concurrency for QKV projections: Support for running concurrent CUDA streams to hurry up model inference. To make use of this feature, pass within the –CUDA_GRAPH_OPT=1 flag.
- MMVQ kernel optimizations: Pre-loads data into registers and hides delays by increasing GPU utilization on other tasks, to hurry up the kernel.
- Faster model loading time: As much as 65% model load time improvements on DGX Spark, and 15% on RTX GPUs.
- Native MXFP4 support on NVIDIA Blackwell GPUs: As much as 25% faster prompt processing on LLMs using the hardware-level NVFP4 fifth-generation of Tensor Cores on the Blackwell GPUs.
Updates to Ollama include:
- Flash attention by default: Now standard on many models. This method uses “tiling” to compute attention in smaller blocks, reducing the variety of transfers between GPU VRAM and system RAM to spice up inference and memory efficiency.
- Memory management scheme: A brand new scheme allocates additional memory to the GPU, increasing token generation and processing speeds.
- LogProbs added to the API: Unlocks additional developer capabilities to be used cases like classification, perplexity calculations, and self-evaluation.
- The newest optimizations from the upstream GGML library.
Take a look at the llama.cpp repository and the Ollama repository to start, and test them in apps like LM Studio or the Ollama App.
Recent advanced audio-video model on RTX AI PC
NVIDIA and Lightricks are releasing LTX-2 model weights—a sophisticated audio-video model that competes with cloud models which you can run in your RTX AI PC or DGX Spark. That is an open, production-ready audio-video foundation model delivering as much as 20 seconds of synchronized AV content at 4K resolution. It may possibly offer frame rates of as much as 50 fps and provides multi-modal control for top extensibility for developers, researchers, and studios.
The model weights can be found in BF16 and NVFP8. The quantized checkpoint delivers 30% memory reduction, enabling the model to run efficiently on RTX GPUs and DGX Spark.
Prior to now weeks, we’ve also seen dozens of latest models being released, each pushing the frontier of generative AI.


The use cases for personal, local agents are countless. But constructing reliable, repeatable, and high-quality private agents stays a challenge. LLM quality deteriorates if you distill and quantize the model to suit inside a limited VRAM budget on PC. The necessity for accuracy increases as agentic workflows require reliable and repeatable answers when interfacing with other tools or actions.
To deal with this, developers typically use two tools to extend accuracy: fine-tuning and retrieval-augmented-generation (RAG). NVIDIA released updates to speed up tools across this workflow for constructing agentic AI.
Nemotron 3 Nano is a 32B parameter MoE model optimized for agentic AI and fine-tuning. With 3.6B energetic parameters and a 1M context window, it tops several benchmarks across coding, instruction-following, long-context reasoning, and STEM tasks. The model is optimized for RTX PCs and DGX Spark via Ollama and llama.cpp, and could be fine-tuned using Unsloth.
This model stands out for being essentially the most open, with weights, recipes, and datasets widely available. Open models and datasets make customizing the model easier for developers. They prevent redundant fine-tuning and eliminate data leakage for objective benchmarking for robust and efficient workflows. Start with LoRA-based fine-tuning for it.
For RAG, NVIDIA partnered with Docling—a package to ingest, analyze, and process documents right into a machine-understandable language for RAG pipelines. Docling is optimized for RTX PCs and DGX Spark and delivers 4x performance in comparison with CPUs.
There are two ways of using Docling:
- Traditional OCR pipeline: It is a pipeline of libraries and models that’s accelerated via PyTorch-CUDA on RTX.
- VLM-based pipeline: A complicated pipeline for complex multi-modality documents, available to be used via vLLM inside WSL and Linux environments.
Docling is developed at IBM and contributed to the Linux Foundation. Start now on RTX with this easy-to-use guide.
SDKs for audio and video effects
The NVIDIA Video and Audio Effects SDKs enable developers to use AI effects on multimedia pipelines that enhance quality using features akin to background noise removal, virtual background, or eye contact.
The newest updates at CES 2026 enhance the video relighting feature to provide more natural and stable results across diverse environments, while improving performance by 3x (reducing the minimum GPU required to run it to an NVIDIA GeForce RTX 3060 or above), and decreasing the model size as much as 6x. To see the Video Effects SDK with AI relighting in motion, try the brand new release of the NVIDIA Broadcast app.
We’re excited to collaborate with the open-source community of AI PC tools to deliver models, optimizations, tools, and workflows for developers. Start developing for RTX PCs and DGX Spark today!
