Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

-


Scientists and engineers who design and construct unique scientific research facilities face similar challenges. These include managing massive data rates that exceed current computational infrastructure capability to extract scientific insights and driving the experiments in real time. These challenges are obstacles to maximizing the impact of scientific discoveries and significantly slow the pace of information growth. 

Scientists and engineers at NVIDIA work with these facilities to develop recent solutions built on parallel and distributed computation that remove these blockers. This post will walk through two notable examples of formalizing complex physics problems into tractable mathematical puzzles that profit greatly from GPU-accelerated scientific computing, involving the U.S. Department of Energy: NSF-DOE Vera C. Rubin Observatory and SLAC’s Linac Coherent Light Source II (LCLS-II)

These unique and massive-scale research facilities each took a decade to construct and enable unprecedented scientific discoveries to serve worldwide scientific communities. NVIDIA accelerated computing along with the GPU-accelerated Python libraries CuPy and cuPyNumeric are enabling live feedback for experiment steering, which was previously unattainable. The team leveraged Accelerated Space and Time Image Evaluation (ASTIA) to process real-time “movies” of the southern sky and X-ray Evaluation for Nanoscale Imaging (XANI) using cuPyNumeric and CuPy to attain real-time steering of LCLS II experiments.

Data analyses that previously took nine months were accomplished in 4 hours.

Astrophysics and ultrafast X-ray science

The breakthrough in experimental advancement has enabled extremely high data acquisition rates to capture more objects than ever before, on their intrinsic time- and length-scales. 

On the Vera C. Rubin Observatory, for the primary time, astrophysicists and astronomists are capable of capture your complete southern sky and discover 2,000+ recent asteroids per night using a 3.2-billion-pixel camera. Meanwhile, on the LCLS II, scientists and engineers drive the electrons, that are converted to photons along a 3-km tunnel to make movies of materials on the atomic scale using ultrafast X-ray bursts.  

Astrophysics: The NSF-DOE Vera C. Rubin Observatory’s LSST camera will produce 20 terabytes of images per night and operate in continuous mode for 10 years to map your complete southern sky every three to 4 nights. Over the course of 1 month or more, the LSST camera reaches petabytes of knowledge accumulation that shall be used to create the 10-year time-lapse movie  of the universe. 

X-ray science: The LCLS-II produces probably the most powerful X-ray pulses—as much as 1 million bursts per second—increasing brightness in comparison with the unique LCLS by an element of 10,000. This allows mapping the swiftest and smallest movements of electrons and atoms inside matter. LCLS-II produces petabyte-scale X-ray data inside days to make movies of quantum phenomena and supply unprecedented insights into how matter behaves. 

The Linac Coherent Light Source long X‑ray particle accelerator tunnel. 
The Linac Coherent Light Source long X‑ray particle accelerator tunnel.
Figure 1. The Linac Coherent Light Source at SLAC has the world’s longest X-ray particle accelerator tunnel, making data available at unprecedented speed and volume. Image credit: SLAC National Accelerator Laboratory

Common challenge: The demand for real time evaluation of massive datasets requires each computational speed and memory beyond traditional systems. Accelerated computing provides the speed of computation, but one must still use distributed systems to process the incredible sizes of those problems. By utilizing HPC systems with acceleration and specialized networking, scientists can meet these demands. Using cuPyNumeric, programmers are capable of utilize a single programming model that works each on traditional systems and utilizes the fashionable hardware features.

Towards full workflow automation: Each facilities move beyond batch evaluation, favoring modular, highly parallel pipelines that execute reliably no matter experiment size. Data movement, transformation, and extraction are automated to the degree that human oversight is targeted on hypothesis and interpretation, relatively than manual intervention or IT tuning.

Solutions: NVIDIA accelerated computing coupled with the GPU-accelerated Python libraries CuPy and cuPyNumeric are together enabling live-feedback for experiment steering, which was previously unattainable resulting from excessively long computations. Now, by running these same scientific evaluation pipelines on NVIDIA DGX Grace Hopper and NVIDIA Blackwell, NVIDIA DGX Spark, NVIDIA RTX PRO, researchers are gaining powerful recent benefits for each performance and collaboration. 

Data analyses that previously took nine months at the moment are possible in 4 hours through cleverly solved equations using distributed computation on GPUs. Unified memory, available in NVIDIA GH200 Grace Hopper Superchip and NVIDIA Blackwell architecture, unlocks massive problem sizes through GPU acceleration to extract physics parameters quickly. These are used to coach AI models for autonomous experiments and science analyses at unprecedented speed.

Vera C. Rubin Observatory accelerated workflow and prompt processing

The LSST traverses the sky in space and time with a 3.2-gigapixel camera to capture the southern sky, producing as much as 20 TB of images per night. Every night, the camera will discover 2,000+ recent asteroids which have never been seen before. The principal scientific goals include:

  • Tracking billions of celestial objects with precise time-resolved measurements.
  • Detecting and classifying transient phenomena which have never been observed before (for instance, supernovae, near-Earth objects, and variable stars).
  • Looking for signatures of dark matter/energy of the ever-expanding universe.
  • Making a year-round repository of the objects and their locations in space and time of the whole southern sky. Send alerts to a worldwide network of broker platforms and astronomical telescopes to amass more detailed follow-up observations of individual stars, galaxies, black holes.

To this point, the astrophysics and astronomy communities have jointly developed an open source CPU-based data processing pipeline to process data in as much as 10 minutes. The timescale for acquisition of every image is 40 seconds. Live data processing—to promptly send alerts to telescopes world wide and steer remark decisions—requires accelerated computing. 

These steps require advanced image calibration, basis constructions, convolutions, subpixel differencing, pattern extraction, and real-time statistical inference on data streams too large for the present CPU cluster processing workflow developed by scientists and engineers from world-wide astrophysics and astronomy communities. 

To understand these goals on an accelerated timescale and enable greater complexity in data processing operations, scientists and engineers at NVIDIA, Princeton University, and SLAC are developing an accelerated GPU workflow, called Accelerated Space and Time Image Evaluation (ASTIA). This workflow includes:

  • Calibration and basis construction: Rapidly calibrate massive CCD data to remove artifacts and distortions, and construct basis functions of every acquired image to enable coordinate mapping and transformations.
  • Chained transformation: Warping, convolutions, background and image subtractions, object movement, error calculations (through CuPy) are benchmarked on each NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
  • Parallelization: Parallel prompt processing (mapping, object detection, fit and catalog) running as a batch or interactive sessions. Numerical computations occur in milliseconds as an alternative of minutes.
  • Packaging and broker alert: Catalog recent objects, orbit information, coordinates, and issue global alerts inside seconds to the worldwide LSST community.
A flow chart demonstrates the 86x to 180x acceleration of prompt processing for the Vera Rubin Observatory using NVIDIA H200 and GH200 GPUs, reducing the initial two steps (GetTemplate and Alard Lupton) from 8.4 and 12.5 seconds to 46 and 146 milliseconds, respectively, to enable rapid generation and distribution of astrophysical alerts.
A flow chart demonstrates the 86x to 180x acceleration of prompt processing for the Vera Rubin Observatory using NVIDIA H200 and GH200 GPUs, reducing the initial two steps (GetTemplate and Alard Lupton) from 8.4 and 12.5 seconds to 46 and 146 milliseconds, respectively, to enable rapid generation and distribution of astrophysical alerts.
Figure 2. Prompt processing workflow for astrophysical alert production and live steering of LSST camera on CPU versus GPU

LCLS II: Scaling with parallel and distributed computation

At LCLS II, ultrafast X-ray pulses generate movies of atomic and electronic dynamics inside materials and molecules. Major science challenges include:

  • Capturing 3D X-ray movies across tens of terabytes in a single session
  • Characterizing defects, phonon dispersions, crystal structures, electron distributions, and quantum phenomena from scattered X-ray patterns at rapid cadence
  • Delivering live feedback for experiment steering, so scientists can adjust parameters in real time to catch rare dynamic states

This requires processing and analyzing data on the single-pixel, single-event level, with mathematical models that may detect and reconstruct complex atomic motions—all under stringent time constraints. In essence, enabling researchers to look at atoms move in real time.

Ultrafast X-ray evaluation of nanoscale imaging (XANI) workflow

At LCLS, NVIDIA and SLAC scientists and engineers developed the pipeline to concurrently process X-ray frames, fit physical models for pixel-wise elements, and rapidly reconstruct the 3D phonon dispersions to extract the thermal, optical, and electrical properties of materials. The evaluation leverages pattern-matching, nonlinear fitting, and large-scale reduction to summarize experiment outcomes in a form meaningful for real-time scientific inference and automatic instrument steering.

A multipanel figure illustrating the scientific discovery cycle at the Linac Coherent Light Source (LCLS), which proceeds from experiment setup to data collection and analysis, and is supported by a performance chart demonstrating 1,100x computational acceleration, reducing processing time from 15 hours (CPU baseline) to 0.5 seconds using 128 GPUs.
A multipanel figure illustrating the scientific discovery cycle at the Linac Coherent Light Source (LCLS), which proceeds from experiment setup to data collection and analysis, and is supported by a performance chart demonstrating 1,100x computational acceleration, reducing processing time from 15 hours (CPU baseline) to 0.5 seconds using 128 GPUs.
Figure 3. LCLS nanoscale science discovery workflow, XANI acceleration of nanoscale imaging. Demonstrated accelerated computing runtime performance on CPUs versus GPUs 

How does XANI speed up the stack?

  • Data ingestion: High-throughput connections rapidly transfer images or experiment data to local cluster, supercomputer, or local DGX Spark storage.
  • Parallelization: cuPyNumeric achieves efficient parallelization across available resources by strategically partitioning the worldwide data arrays. It then distributes computations by mapping operations on these sub-partitions to separate processing units. The runtime also decomposes the scientific code right into a dependency-driven task graph, which enables implicit parallelism and dynamic scheduling of labor across all allocated resources.
  • Operator chains: XANI executes complex transformation graphs (sum, convolution, basis change) as a series of kernels, reducing latency and memory movement overhead. Interoperability through Python-tasks enables embedding of third-party single-GPU Python libraries (CuPy, for instance) for data-parallel operations.
  • Distributed scaling: cuPyNumeric enables array and matrix computations to scale from desktop to thousand-GPU clusters, handling datasets that exceed a single node’s memory—all natively in Python.​​
  • Collaboration and control: Researchers access their environment and computational results interactively, monitor GPU/CPU utilization, and profile performance with built-in tools.

Accelerated computation enables physics-informed AI training

The CUDA Python stack provides an integrated solution for:

  • Developing accelerated mathematical kernels and functions that are widely compatible with the Python ecosystem when existing solutions don’t exist already. 
  • CuPy offers GPU-compatible NumPy and SciPy interfaces to enable parallelism on a single GPU to speed up numerical computations.
  • cuPyNumeric delivers a well-recognized NumPy/SciPy interface, which enables distribution of computation across multi GPUs and nodes using advanced runtime management.​​
  • XANI uses high-performance array operations and transformation chains, optimized for tasks like matrix math, subpixel warping, and polynomial projection. This package accelerates ultrafast X-ray characterization with GPU kernels and advanced workflow integration.
  • All the above mentioned codes are optimized to run on servers based on Grace Hopper and Grace Blackwell. For individual testing and development, running these codes on DGX Spark or RTX PRO provides accelerated results in comparison with running on CPU systems.

Suggestions for using GPUs and CUDA Python for science

To make use of GPUs and CUDA Python to unravel scientific problems, follow these strategies:

  • Discover the important thing scientific questions, followed by relevant mathematical operations and models that will be solved linearly. Develop a workflow to process the raw data and solve for the models using NumPy, then port to CuPy locally for parallelization. For 1000’s to billions of computations that require multinode systems, introduce cuPyNumeric to distribute the identical code across multiple GPUs and nodes, leveraging the identical patterns discussed on this post.
  • For ultrafast X-ray and other pixel-wise, model-fitting workloads, XANI provides an open, Python-based pipeline that wraps high-performance GPU kernels and uses the cuPyNumeric to distribute vectorized tasks over available resources and schedule them across many GPUs. Interested teams can clone XANI, treat it as a reference design, and adapt their very own domain-specific steps—reminiscent of data ingestion, operator graphs, fitting, and reductions—to run with cuPyNumeric distributed execution for cluster-scale acceleration.
  • The identical software stacks (CuPy, cuPyNumeric, and XANI) run on a spectrum of NVIDIA hardware (NVIDIA DGX Spark, NVIDIA RTX PRO Server in addition to workstation and desktop-class systems through 8-way servers and NVIDIA DGX SuperPODs equipped with NVIDIA Grace Hopper and NVIDIA Grace Blackwell platforms) with unified memory simplifying handling of datasets larger than a single device. This implies developers and researchers can begin by reproducing similar scaled-down workflows on a laptop, workstation, single DGX Spark or small lab cluster, then move unchanged code to cloud or larger on-premises DGX systems, using open repos as templates and focusing effort on domain logic relatively than rewriting for brand spanking new hardware.
  • Adopt CUDA Python to achieve fast processing and live-steering of scientific instruments and extract scientific insights in seconds.

Advantages of adopting accelerated computing to enable live-steering experiments

Adopting accelerated computing to enable live-steering of scientific experiments offers quite a few advantages, including:

  • Elastic scalability: The identical Python code, powered by cuPyNumeric and CuPy, will be run unmodified on modest local clusters after which scaled out to exascale resources or supercomputer nodes when needed.​
  • Shorter time to insight: Accelerated networking and device-level parallelism means the information is processed because it arrives—enabling discoveries, experiment steering, or event detection in timescales aligned with instrumentation.
  • Resource optimization: High-density, energy-efficient DGX Spark nodes deliver performance comparable to large-scale cluster racks in a compact office footprint.
  • Unified memory: Unlocks higher performance and suppleness to speed up CPU-GPU workflow. With NVLink C2C, CPU and GPU share a single virtual address space for giant data structures, as much as 128 GB, with very high bandwidth, low latency, and concurrency. For physics-informed AI, this implies simpler code and better sustained throughput that will not be constrained by a slower, higher latency, PCIe link.
  • Collaborative science: Teams profit from shared data, distributed compute jobs, and rapid workflow iteration—crucial for multi-institutional research, experiment repeatability, and open science.

Start with accelerated computing for science

XANI, cuPyNumeric, the broader NVIDIA accelerated computing stack, and CuPy are already powering production-scale astrophysics and ultrafast X-ray science. The identical open source Python libraries and NVIDIA platforms can be found for any researcher or developer to adopt in their very own workflows. 

XANI, CUDA Python, cuPyNumeric, and CuPy display a generational leap in scientific computing capability for exascale-era facilities, reminiscent of the Rubin Observatory and LCLS-II. By merging local desktop-class hardware, scalable server infrastructure, scalable software, and high-performance networking, researchers can develop, test, and deploy massive data workflows faster and more flexibly than ever before. Whether analyzing a single sky survey or orchestrating a world experiment,  NVIDIA accelerated computing empowers science teams to attain real-time insight and discovery.

Start with CUDA Python, cuPyNumeric, and CuPy.

Learn more on the NVIDIA GTC AI Conference with the session, Accelerated HPC+AI Workflow Enables Live-Steering of Vera C. Rubin Observatory and X-ray Free Electron Laser [S81766].

Acknowledgments

Because of Yusra AlSayyad and Nate Lust (Princeton University); Adam Bolton, Seshu Yamajala, and Jana Thayer (SLAC National Accelerator Laboratory); and Lucas Erlandson, Emilio Castillo Villar, Malte Foerster, and Irina Demeshko (NVIDIA) for his or her contributions.   



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x