Accelerating Pharma R&D with AI-Powered Structural Intelligence

-



This summer, SandboxAQ released the Structurally Augmented IC50 Repository (SAIR), the biggest dataset of co-folded 3D protein-ligand structures paired with experimentally measured IC₅₀ labels, directly linking molecular structure to drug potency and overcoming a longstanding scarcity in training data. This dataset is now available on Hugging Face, and for the primary time, researchers have open access to greater than 5 million AI‑generated, high‑accuracy protein-ligand 3D structures, each paired with validated empirical binding potency data.

image/png

SAIR is an open-sourced dataset and is publicly available at no cost under a permissive CC BY 4.0 license, making it immediately actionable for business and non-commercial R&D pipelines. Greater than only a dataset, SAIR is a strategic asset that bridges the long-standing data gap in AI-powered drug design. It empowers pharmaceutical, biotech, and tech‑bio leaders to speed up R&D, expand goal horizons, and supercharge AI models – moving more of the costly, lengthy drug design and optimization from the wet lab to in silico. This implies shorter hit‑to‑lead timelines, more efficient lead optimization, fewer dead‑end projects, and a more predictable path from initial idea to clinical candidate.

AI and computer-aided design have great potential in dramatically accelerating the event of recent drugs. For many years, scientists have dreamed about AI that would discover or design a potent, non-toxic, and efficacious compound from a prompt describing the disease pathway, practically compressing years of drug R&D into just a few minutes on a pc. Nevertheless, this vision is bottlenecked by AI’s ability to predict critical drug properties like potency, toxicity, etc., based solely on its molecular structure.

Moreover, traditional structure‑based discovery is commonly slowed early by the determination of reliable 3D structures. Three‑dimensional molecular structure dictates a molecule’s functionality, dynamics and interactions, which is very vital when a possible drug candidate is anticipated to bind to a human protein goal.

Experimental methods, comparable to X-ray crystallography and cryo-EM, require extensive time and investment, and plenty of promising disease targets still lack experimentally validated structural information. Computer simulations have helped lower the barrier of getting 3D structures and predicting binding affinity. Nevertheless, earlier generations of algorithms for protein folding and docking (like AlphaFold and Vina respectively) only predict static snapshots of molecules and proteins (which, in point of fact, are inherently dynamic and shape-changing).

SAIR solves that constraint by compiling over 1 million unique computationally co-folded protein–ligand pairs, ultimately yielding 5.24 million distinct 3D complexes (five different co-folded structures per pair). Each structure is paired with a curated IC₅₀ measurement from ChEMBL or BindingDB, providing for the primary time a scalable link between high-quality 3D structures and drug potency, and bridging the historic data gap that has hindered AI-driven discovery. Deep-learned affinity models comparable to Boltz-2, trained on similar data, have been shown to yield as much as a 1,000x speed-up over the standard, first-principle approach.

Creating SAIR was a serious feat of high-performance AI computing. It took greater than 130,000 GPU hours to compute the SAIR dataset using Boltz1, a cofolding AI model, on a cluster of 760 NVIDIA H100 processors, leveraging the NVIDIA DGX Cloud through Google Cloud Platform.

Capturing highly granular node, operator, scheduler, and GPU metrics, in addition to an in depth collaboration on each infrastructure and workload optimization, helped NVIDIA AI Accelerator and SandboxAQ engineering teams discover bottlenecks and optimize configurations to attain the best workload throughput.

Consequently, the 2 teams were in a position to achieve > 95% GPU compute utilization for generating the SAIR dataset. This enabled us to create SAIR in three weeks – versus the unique estimate of three months (greater than a 4X speed up) – and resulted in a highly optimized, GPU-native computational workflow that seamlessly integrates with today’s cutting-edge enterprise compute environments.

Generating such an enormous volume of knowledge is barely half the story. Equally vital is confidence in its quality, which is why every predicted complex underwent rigorous validation with PoseBusters, an industry-standard, open-source tool for benchmarking structure-related AI in drug discovery. This tool checks chemical sanity and physical plausibility.

The tip result was that 97 percent of SAIR’s structures passed all checks. Along with PoseBusters validation, we benchmarked leading affinity prediction methods, comparable to empirical scoring functions, 3D CNNs, and graph neural networks, across SAIR’s synthetic structures and experimental IC₅₀ values. The detailed results of those studies can be found in our scientific manuscript on bioRxiv.

SAIR data is a reliable foundation for benchmarking latest models in addition to downstream modelling, screening, and design.

A persistent challenge in drug discovery is the “dark proteome,” or disease‑relevant proteins for which experimental structures simply don’t exist. SAIR illuminates these uncharted regions by providing credible, AI‑predicted complexes wherever experimental data is scarce. For instance, greater than 40 percent of the proteins within the SAIR dataset don’t have any available structures within the Protein Data Bank (PDB) by any means, with or with out a ligand. SAIR addresses certainly one of the largest challenges with existing AI models, low generalizability as a result of data scarcity. With SAIR, scientists can now explore targets that were previously deemed undruggable, armed with structural hypotheses to guide virtual screening and lead optimization using trustworthy model predictions.

Furthermore, SAIR’s cross‑goal breadth uncovers polypharmacology patterns and elucidates how a single molecule might interact with multiple proteins. Leveraging this wealthy tapestry of interactions, you may train AI models to predict off‑goal effects or discover latest repurposing opportunities, equipping your organization with a deeper understanding of compound profiles before any lab work begins.




Accessing SAIR

SAIR is freely available on Hugging Face. Here’s a fast guide to drag SAIR from Hugging Face, peek on the essential table, and (optionally) download just a few structure archives.



1. Install essentials

We use the Hub to fetch files and pandas+pyarrow to read the Parquet.

pip install huggingface_hub pandas pyarrow



2. Authenticate

Authenticate to Hugging Face:

import huggingface_hub
huggingface_hub.login(token="your_auth_token")



3. Load the essential table (sair.parquet)

This grabs the file from the Hub and loads it right into a DataFrame.

from huggingface_hub import hf_hub_download
import pandas as pd

parquet_path = hf_hub_download(
    repo_id="SandboxAQ/SAIR",
    filename="sair.parquet",
    repo_type="dataset"
)

df = pd.read_parquet(parquet_path)
df.head()



4. (Optional) List available structure archives

Structure files are shipped as many .tar.gz archives under structures_compressed/. List them and pick what you wish.

from huggingface_hub import list_repo_files

files = [f.split("https://huggingface.co/")[-1] for f in list_repo_files("SandboxAQ/SAIR", repo_type="dataset")
         if f.startswith("structures_compressed/") and f.endswith(".tar.gz")]
files[:5]



5. (Optional) Download and extract structures

Each archive could be large (≈10 GB). Download only those you wish and extract them locally.

import os, tarfile
from huggingface_hub import hf_hub_download

dest = "sair_structures"
os.makedirs(dest, exist_ok=True)

to_get = [
    "sair_structures_1006049_to_1016517.tar.gz",
    "sair_structures_100623_to_111511.tar.gz",
]

for name in to_get:
    tar_path = hf_hub_download(
        repo_id="SandboxAQ/SAIR",
        filename=f"structures_compressed/{name}",
        repo_type="dataset",
        local_dir=dest,
        local_dir_use_symlinks=False,
    )
    with tarfile.open(tar_path, "r:gz") as tar:
        tar.extractall(dest)
    os.remove(tar_path)  

A full version of this script, including more robust logging and validation, is offered within the README file in your convenience. For more details, visit the SAIR homepage, read our manuscript on bioRxiv, or watch our 25-minute joint webinar with NVIDIA, where we reveal SAIR and explain how data is structured inside it. Extensive documentation, tutorials, and example benchmarks can be found to facilitate its use and speed up internal adoption.

The longer term of drug discovery is data-driven, AI-accelerated, and grounded in scalable, high-quality structural insights. While we don’t yet have AI that may design effective drug therapies with only a prompt, SAIR brings researchers ever closer to that goal with latest data and insights that may potentially shave years from even AI-accelerated R&D pipelines.

We will’t wait to see what researchers will construct using SAIR, and SandboxAQ experts are here to support them throughout the invention process.




Questions?

Contact the authors or post on the SAIR dataset discussion page.

Authors: Arman Zaribafiyan, Georgia Channing, Zane Beckwith, and Rudi Plesch



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x