a comprehensive public dataset for drug-target interaction modeling

-


Elaine McVey Houskeeper's avatar

Georgia Channing's avatar

Therapeutic drugs are ubiquitous, with billions of prescriptions written worldwide yearly. Within the last 30 days, half of Americans have taken not less than one prescription medication, and 1 / 4 have taken not less than three (CDC). Despite this, we all know relatively little in regards to the scope of effects these drugs have on the human body. Pharmaceutical corporations focus their efforts on making latest drugs which are protected and effective, typically with a single “goal” in mind. The drug development process doesn’t include comprehensively understanding all the results a drug can have with other proteins it might encounter within the body.

Imagine a matrix of each approved drug against every potential protein a drug can act on (the “human druggable genome”). Until now, our knowledge of drug activities has been an incredibly sparse version of this matrix. We regularly – but not all the time! – know the first goal of a drug and what effect it has on that targeted protein. Sometimes we find out about a handful of other “off-target” effects. But a lot of the drug-target matrix is a mystery.

Imagine as an alternative that we now have a measurement for each single one in all these drug-target combos that either confirms the drug is inactive at a specific goal, or quantitatively characterizes its activity. That is the pharmome map.

What could we do if we had this map? Understanding all the results of a drug means we are able to address questions resembling:

  • What patterns of activity are related to particular antagonistic events (AE modeling)?
  • Is that this drug’s mechanism of motion simply via the known “on-target” activity, or are its effects driven by interactions across multiple targets (polypharmacology)?
  • What do patterns of activity suggest in regards to the effects of drug combos (polypharmacy)?
  • What drugs act on targets which may make them suitable for treating latest indications (drug repurposing)?
  • Can we predict activity patterns across targets for novel compounds? (structure-activity relationship modeling)

While the pharmome map has been sparsely known thus far, the breadth and depth of information available to hitch to the pharmome map is vast: clinical trials (ClinicalTrials.gov), antagonistic event surveillance (FAERS), individual health records (UK Biobank), many -omics databases. An entire pharmome map unlocks latest ways to know these outcomes and relate them back to drug activity.

image1eve



Mapping is underway!

EvE Bio is a non-profit (a Focused Research Organization under Convergent Research) that’s generating the pharmome map and putting it in the general public domain. EvE develops assays in a single format for the members of every goal class, then carries out a quantitative high throughput screening and profiling process that gives the ultimate measurements. By approaching dataset creation as the first goal, EvE is in a position to provide the form of comprehensive and consistently generated dataset that is good for machine learning. This public dataset is already the most important of its kind, and is actively expanding with latest data added every other month.

EvE is currently focused on the portion of the pharmome map representing a 1,397 member compound library, primarily composed of FDA-approved small molecule drugs, measured against key classes of drug targets. These goal classes were chosen because they’re therapeutically relevant, druggable by small molecules, and addressable at scale by in vitro assays. The three goal classes included are nuclear receptors (NRs), 7-transmembrane receptors (7TMs, aka GPCRs), and protein kinases (PKs). Collectively these cover the intended targets for greater than half of FDA-approved small molecule drugs. Small molecule drugs are those typically available in traditional pill form, resembling statins, tamoxifen, and metformin. They’re well suited to high throughput screening and profiling approaches.



Goal Classes

Each goal class plays a critical role in physiology and pharmacology, and every is addressed with a unique assay format that is taken into account best suited to high-throughput screening for physiologically relevant activity. While the identical sorts of response measures are collected across classes, understanding the individuality of every goal class will inform data usage. (Details beyond what’s included here will be found on EvE Bio’s methods site.)

Nuclear receptors directly regulate gene expression, controlling which proteins get created and influencing the long run behavior of a cell. This can be a small (<50 members) but highly impactful receptor class, representing the targets for greater than 10% of approved small molecule drugs. NRs are activated by ligand binding, with ligand binding domains which have collectively evolved to bind a various set of small molecules. That is advantageous for drug targeting, since it enables selective design not just for specific NRs but additionally for the kind and degree of activation desired. Drugs will be full or partial agonists (increasing the receptor’s activity over the basal level), antagonists (blocking agonism of the receptor), or inverse agonists (reducing the basal activity level). NR activity is measured with biochemical co-factor recruitment assays that reflect the conformational changes induced by ligand binding. These assays are individually configured for agonist and antagonist modes.

7TMs – also often known as G-protein coupled receptors (GPCRs) – sense a wide range of extracellular signals and translate them into intracellular responses, effectively telling the cell what’s happening around it. Greater than a 3rd of FDA-approved drugs goal 7TMs and these address a variety of therapeutic areas. 7TMs are a big goal class that has evolved to sense a diversity of molecules, making them exceptionally druggable. They’ve multiple binding sites with diverse ligand possibilities, with the potential for selective activation (via biased agonists that preferentially activate one pathway over one other), and the potential for ligands to regulate agonism and antagonism. Since 7TMs are on the cell surface, drugs needn’t cross the cell membrane to access them. 7TM activity is measured with cell-based assays which are configured for agonist and antagonist modes.

Protein kinases (PKs) are enzymes that catalyze phosphorylation, effectively controlling many molecular “switches” inside cells. Since these switches precisely regulate a wide range of critical processes, kinases enable computational complexity inside cells via feedback loops, cascades, and signal integration. PKs are a more recent and rapidly growing set of targets for FDA-approved drugs and are particularly relevant to cancer, where mutations can result in dysregulated activation. These disease relevant mutations are included within the pharmome map wherever possible. PK activity is measured with biochemical competition-based ligand binding assays in a single mode (inhibition).

In 2026, the variety of 7TM and PK targets within the EvE pharmome mapping dataset will increase ~3x. This can include the addition of G-protein in addition to β-arrestin data for 7TMs, which can allow for modeling of biased signaling via these two pathways. (Modern pharmacology experts consider the quantification of biased signaling a key opportunity for improved drug design.)

Along with NRs, 7TMs, and PKs, the information includes measurements of cell viability for every compound (labeled as goal class “Viability”), based on an assay that measured ATP production. These results reflect cytotoxic effects, that are meaningful endpoints unto themselves with regard to compound activity. Moreover, it’s critical to interpret 7TM antagonism data within the context of viability results, as cell death can masquerade as antagonism in cell-based assays.



Data Structure

The important thing response variables are compound activity and potency. Binary activity and maximum observed activity is captured for each compound-assay combination (outcome_is_active, outcome_max_activity). Activity is expressed as a % of maximum activity, in reference to known standard compounds for every assay. For lively compounds which have sufficient potency to be measurable within the concentration range tested (is_quantified), four-parameter logistic curve matches lead to quantified potency, measured as pXC50 (outcome_potency_pxc50). pXC50 is the negative log of the IC50/EC50 – the concentration at which half of the utmost activity is reached. Higher pXC50s are higher potency, and 5 is the bottom quantifiable pXC50 within the concentration range used.

image2eve

To gather these measurements, EvE uses a two-phase quantitative screening process. All combos of compounds and assays are included within the screening phase, which incorporates two replicates of three concentrations. A rules-based progression algorithm determines which compounds advance to the profiling phase, where the concentration range is 10 μM to 10 pM. The total concentration response is effectively censored by the concentration range tested. For low potency compounds, this results in results which are reported as lively, but not quantified.

Along with cytotoxicity, compounds can interfere with assays in various ways, resulting in potentially spurious results. Compounds that appear with suspicious frequency for any given goal class and mode are flagged as “high frequency”. They may very well be faraway from the information before model development, but in some cases true activity can be lost in the method. Alternatively, this frequency flag may very well be treated as a response in itself, with a view to develop models that link compound and concentration response characteristics with particular types of interference. Columns that flag combos where either cell viability or hit frequency merit consideration are included within the dataset (viability_flag, frequency_flag).

The dataset accommodates one row per combination of goal, compound, mode, and mechanism (currently there is simply one mechanism per goal class, but it will change when data for each signaling pathways is added for 7TMs in 2026). NRs and 7TMs have two modes each, while PKs and cell viability have one. Multiple identifiers are included for each compounds and targets. For compounds: SMILES (a text-based chemical representation), InChIkey, CAS #, UNII, and DrugBank ID. For targets: gene, Uniprot ID, and mutant/wildtype indicators.



Able to start?

from datasets import load_dataset


ds = load_dataset("eve-bio/drug-target-activity")

Or, view the dataset on Hugging Face here.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x