Biotech Platform Publishes Open Dataset to Accelerate Drug Discovery

Biotech Platform Publishes Open Dataset to Accelerate Drug Discovery

Biotech Platform Publishes Open Dataset to Accelerate Drug Discovery

PALO ALTO, Calif., June 18, 2025

SandboxAQ, an enterprise AI company delivering solutions at the nexus of AI and quantum technologies, today announced the public release of the Structurally Augmented IC50 Repository (SAIR), the largest open-source dataset of co-folded 3D protein-ligand structures paired with experimentally measured binding potency data. Available under a permissive CC BY 4.0 license, SAIR provides the global research community with unprecedented access to more than 5 million high-accuracy, AI-generated structural models to train advanced machine learning models for drug discovery applications. The release addresses a critical bottleneck in pharmaceutical R&D, where data scarcity and high costs have historically limited AI applications in early-stage drug development.

The SAIR dataset tackles a fundamental challenge that has constrained AI model development for decades: the scarcity of high-quality, experimentally validated protein-ligand binding data linked to three-dimensional structural information. Traditional drug discovery relies on expensive and time-consuming experimental assays to measure compound potency, with pharmaceutical companies spending an estimated $2.6 billion per successful drug, according to the Tufts Center for the Study of Drug Development.

By providing experimentally validated IC50 labels directly linked to AI-generated 3D molecular structures, SAIR enables researchers to develop predictive models that deliver accurate potency assessments essential for early-stage drug screening and lead optimization. Each entry represents a synthetically generated, high-fidelity protein-ligand complex structure validated against empirical binding data, creating a robust bridge between structural biology and pharmacological activity.

Research from open science initiatives demonstrates that leveraging comprehensive public datasets can compress research timelines from months to days by eliminating data generation bottlenecks that typically slow early discovery. SandboxAQ’s internal benchmarking shows that AI models trained on SAIR achieve predictions at least 1,000 times faster than conventional physics-based simulation methods while maintaining accuracy comparable to experimental assays. This acceleration enables pharmaceutical and biotech companies to evaluate vastly larger chemical spaces during hit identification, prioritizing high-potential candidates before committing to costly wet-lab validation. The dataset spans multiple target classes including kinases, GPCRs, and proteases relevant to oncology, immunology, and neurodegeneration, with particular strength in historically challenging protein families once considered “undruggable.”

“SAIR represents a paradigm shift in computational drug design,” said Dr. Nadia Harhen, General Manager of Life Sciences at SandboxAQ. “For the first time, researchers worldwide have free access to a production-scale dataset that links 3D structural information directly to binding potency. By democratizing this resource, we’re empowering smaller biotechs and academic labs to compete on equal footing with large pharmaceutical companies, ultimately accelerating delivery of novel therapeutics to patients.” The release reflects a broader industry movement toward open science collaboration, with emerging frameworks for responsible data sharing demonstrating that shared infrastructure can expand the total addressable market for AI-driven drug discovery while accelerating development timelines.

The launch comes as the AI-enabled drug discovery market is projected to reach $20.3 billion by 2030, according to Grand View Research, driven by escalating demand for technologies that improve research efficiency and reduce development costs. SAIR is already integrated into R&D pipelines at leading institutions including UCSF’s Institute of Neurodegenerative Diseases, Riboscience, and the Michael J. Fox Foundation. Early adopters report that models trained on SAIR achieved hit rates three to five times higher than conventional virtual screening approaches in pilot studies, validating the dataset’s utility for real-world applications. The dataset is publicly available on Hugging Face and Google Cloud Platform, with comprehensive documentation and pre-trained model weights to facilitate rapid adoption across commercial and non-commercial research environments.

SandboxAQ developed SAIR using its proprietary Large Quantitative Models, which combine quantum-inspired algorithms with deep learning to generate structurally accurate protein-ligand complexes at scale. The generation process incorporates physical constraints from quantum mechanics and experimental binding affinity data from multiple sources, ensuring both structural realism and biological relevance. Researchers can access SAIR at sandboxaq.com/sair and collaborate with SandboxAQ’s scientific team to apply advanced AI models to challenging therapeutic targets. The company plans quarterly updates expanding coverage to protein-protein interactions and allosteric binding sites by Q4 2025.

“We’re witnessing unprecedented demand for AI-driven solutions that deliver measurable impact on pharmaceutical development success rates,” said Dr. Harhen. “SAIR directly addresses the data scarcity problem that has prevented many AI drug discovery initiatives from reaching their full potential. Our partners are already seeing superior hit rates and shorter hit-to-lead timelines by combining SAIR with our quantitative AI platform.”

About SandboxAQ:

SandboxAQ is an enterprise AI company delivering solutions at the intersection of artificial intelligence and quantum techniques. Originally incubated at Alphabet Inc., the company operates as an independent, growth-backed organization funded by leading institutional investors including funds advised by T. Rowe Price Associates, Alger, IQT, US Innovative Technology Fund, S32, Paladin Capital, BNP Paribas, and strategic individuals including Eric Schmidt, Ray Dalio, Marc Benioff, Thomas Tull, and Yann LeCun. The company’s Large Quantitative Models provide critical advances in life sciences, financial services, navigation, and other sectors requiring complex quantitative modeling. For more information, visit sandboxaq.com.

Media Contact

Sarha Al-Mansoori
Director of Corporate Communications
G42
Email: media@g42.ai
Phone: +971 2555 0100
Website: www.g42.ai