OpenBind releases initial structure and affinity dataset for structure-based machine learning
Key Takeaways
- OpenBind is generating dense, high-quality protein-ligand datasets that link structural data with binding measurements at a large scale.
- This release provides a realistic, structurally novel testbed linking high-resolution crystallographic data directly with biophysical binding affinity.
- The initial release targets the Enterovirus A71 (EV-A71) 2A protease (modeled via a closely related Coxsackievirus A16 surrogate system), an essential viral replication enzyme and key target for pandemic preparedness.
- By providing a coherent, experimentally determined target dataset that is not already part of public training sets, this release provides a benchmark for testing docking, cofolding, and affinity-prediction methods.
OpenBind is the UK’s open science initiative to create the world’s largest dataset of drug-protein interactions led by collaboration among a robust team of researchers, including Frank DiMaio, Mohammed AlQuraishi, John Chodera, and Karmen Čondić-Jurkić. It has released its first public dataset linking protein-ligand structures with experimental binding affinities to improve structure-based machine learning models.
Training machine learning models requires high-quality experimental data
The next generation of structure-based machine learning methods requires better experimental data rather than just new computational architectures. While artificial intelligence has improved predictive accuracy for protein structures, its impact on drug discovery is constrained by a global shortage of reliable experimental data. Current structure-based tools need datasets that accurately tie atomic-level protein structures directly to physical binding measurements.
To resolve this limitation, OpenBind is generating dense, high-quality protein-ligand datasets that link structural data with binding measurements at a large scale. These open, standardized datasets are designed to support model training, fine-tuning, benchmarking, and error analysis. By providing data where the learning signal is high, the initiative targets compound series and viral targets that are under-represented or difficult for existing computational methods to model.
Building an operational pipeline for structural and affinity data
The first phase of the OpenBind initiative focused on establishing the practical infrastructure needed to generate high-quality data repeatedly and rapidly. This pipeline coordinates target preparation, crystallographic workflows, compound progression, data processing, and affinity measurement.
The structural data in this initial release builds upon earlier research from the AI-driven Structure-enabled Antiviral Platform (ASAP) Discovery Consortium. To generate the data, high-throughput experimentation capabilities at Diamond Light Source were combined with state-of-the-art computational modeling, allowing OpenBind to efficiently optimize crystallographic workflows and compound progression to link a dense fragment screen directly with affinity measurements.
A unified dataset for viral protease targets
The first public dataset from OpenBind focuses on a target from a major public health threat: Enterovirus A71 (EV-A71) 2A protease. All experimental data were generated using Coxsackievirus A16 (CVA16) 2A protease as a surrogate system for the EV-A71 2A protease. These two proteins differ in only five positions in the amino acid sequence, none of which are close to the active site.
The completed dataset contains 925 crystallographic binding events derived from 699 compounds, which were discovered through a crystallographic fragment screen and follow-on chemistry. Alongside these structures, the release includes associated binding affinity measurements for 601 of the compounds. These affinities are reported as KD values measured using the Creoptix WAVEsystem, explicitly pairing the physical structure of each protein-ligand complex with its exact binding strength.
Benchmarking and evaluating structure-based AI tools
By providing a coherent, experimentally determined target dataset that is not already part of public training sets, this release provides a benchmark for testing docking, cofolding, and affinity-prediction methods. Researchers can use the dataset to evaluate how well models learn within a single chemical series, how pose and affinity predictions respond to local chemical changes, and where current computational methods fail. OpenBind has released reference benchmarks on GitHub alongside the raw data on Zenodo and Fragalysis to allow the community to directly evaluate and troubleshoot their machine learning models.
Data: Zenodo / Fragalysis
Benchmarks: GitHub
Experimental protocols: OpenBind protocols.io Workspace
Further information about OpenBind: https://openbind.uk/
