As for the comparison to other methods: In my mind, instead of using a QSAR model, a transfer learning scenario would be more applicable. At least REINVENT uses iterative optimization (reinforcement learning) in contrast to pre-training (Enki) by default, so generating a mere 100 compounds will (almost) sample from the initial prior (or did you perform some sort of "warm-up" phase?) - so it is no surprise that you don't get any enrichment. Lastly, the choice of a 100 generated compounds seems extremely low: In a real-world application, when you optimize 10+ scoring components at the same time and not a single objective, it's quite likely that only a few of your 100 molecules will strike a balance that is appealing to a MedChemist
We ran REINVENT for 10k iterations of reinforcement learning, with the reward function defined in terms of QSAR models for the potencies of the on- and off-targets. The QSAR models were trained on the same dataset as Enki; QED was computed exactly for the REINVENT reward function, whereas Enki was required to predict QED. Each REINVENT run of 10k iterations had a batch size of 128, generating 1.28M molecules and requiring 14 hours on a single GPU. The best 100 molecules from a REINVENT run, as predicted by the QSAR models, were selected as the output and subject to evaluation of their true properties. REINVENT achieves significant enrichment compared to the initial prior or an untargeted high-throughput screening library, but is inferior to Enki.
In the selectivity experiments reported in this post, we optimized an objective that is a combination of potency to an on-target, selectivity to a structurally related off-target, and QED. QED is itself the geometric mean of components defined in terms of the molecular weight, octanol-water partition coefficient, number of hydrogen bond donors, number of hydrogen bond acceptors, molecular polar surface area, number of rotatable bonds, number of aromatic rings, and number of structural alerts. While this is simpler than a typical preclinical target-product profile, it combines either 3 or 10 scoring components, depending upon how you count, and is thus representative of the sort of multi-property optimization required for drug discovery.
The budget of 100 compounds is intended to reflect the first iteration of hit discovery/lead optimization. Each of the compounds will generally comprise a distinct chemotype, and cost on the order of $1000 to synthesize, so significantly larger synthesis budgets would quickly become onerous. Enki is designed to perform multi-parameter optimization, and we have successfully satisfied over 10 potency, selectivity, and physicochemical constraints with synthesis budgets much smaller than 100 novel compounds. Additional rounds of lead optimization would certainly be required, but these rounds are generally elaborations of a single (or at most a few) scaffolds. This constitutes a different, constrained optimization problem, but also significantly reduces the cost of synthesis per molecule.
Certainly interesting results but it is rather unusual to run 10k epochs with REINVENT, it is more typical to run < 1000 epochs and then run multiple replicates. As pointed out by Christian real-world examples run with multiple scoring components including biologically relevant endpoints (ADMET, PK). You have really only shown that you can improve on affinity which I understand really just means docking scores. That's just ligand design and very far off from candidate selection, not to mention actual clinical trials.
Synthesizability as estimated from a software tool may not be that informative. Have you actually syntesized any of your best 100 and assayed them?
I wonder if you would be willing to share your inputs for others to reproduce your findings.
Thanks for your comment, Hannes. We followed the practice of other prominent benchmarking efforts in our REINVENT protocol. In particular, Practical molecular optimization (Gao, et al., 2022), Tartarus (Nigam, et al., 2023), and "Generative models should at least be able to design molecules that dock well" (Cieplinski, et al., 2023), do not appear to use multiple restarts, based upon an analysis of both the papers and the released code. However, multiple restarts is certainly a sensible strategy to combine with REINVENT, and we will explore it in our future work.
In the selectivity experiments reported in this post, we optimized a multi-property objective including 10 scoring components: potency to an on-target, selectivity to a structurally related off-target, and 8 physicochemical properties. This is representative of the sort of multi-property optimization required for drug discovery. For the purpose of comparing multiple generative algorithms and high-throughput screening, it is essential that these 10 properties be efficiently evaluable, so we cannot use experimental potency and ADMET measurements for this sort of benchmarking.
When we use Enki for drug discovery, we train on experimental potency and ADMET measurements rather than docking scores, and evaluate by synthesizing and experimentally testing the generated compounds. We generally find that ~90% of our generated compounds are synthesizable, with a large fraction satisfying the potency, selectivity, and other criteria in the TPP. We look forward to sharing a case study in the near future.
Unfortunately, our dataset includes commercially licensed data, which we cannot share publicly.
Many thanks for your feedback. This is very valuable to us.
Doing proper benchmarking is a bit of an art and it seems a non-brainer to me to do replicate runs when simuating a stochastic process. When you have 10 scoring component simultaneously you need to take care that you do have components which push in diferent directions. We hav staged learning for this (basically successive RL aka curriculum learning). What would be interesting from your long runs is when (in which epoch) the interesting compounds are generated to see how efficient the process is. Also, what is the chemical space covered in the long tail of the run.
I understand the problem with sharing data but it would be good to see a public case study.
Thanks for the literature overview!
As for the comparison to other methods: In my mind, instead of using a QSAR model, a transfer learning scenario would be more applicable. At least REINVENT uses iterative optimization (reinforcement learning) in contrast to pre-training (Enki) by default, so generating a mere 100 compounds will (almost) sample from the initial prior (or did you perform some sort of "warm-up" phase?) - so it is no surprise that you don't get any enrichment. Lastly, the choice of a 100 generated compounds seems extremely low: In a real-world application, when you optimize 10+ scoring components at the same time and not a single objective, it's quite likely that only a few of your 100 molecules will strike a balance that is appealing to a MedChemist
Thanks for the comment, Christian.
We ran REINVENT for 10k iterations of reinforcement learning, with the reward function defined in terms of QSAR models for the potencies of the on- and off-targets. The QSAR models were trained on the same dataset as Enki; QED was computed exactly for the REINVENT reward function, whereas Enki was required to predict QED. Each REINVENT run of 10k iterations had a batch size of 128, generating 1.28M molecules and requiring 14 hours on a single GPU. The best 100 molecules from a REINVENT run, as predicted by the QSAR models, were selected as the output and subject to evaluation of their true properties. REINVENT achieves significant enrichment compared to the initial prior or an untargeted high-throughput screening library, but is inferior to Enki.
In the selectivity experiments reported in this post, we optimized an objective that is a combination of potency to an on-target, selectivity to a structurally related off-target, and QED. QED is itself the geometric mean of components defined in terms of the molecular weight, octanol-water partition coefficient, number of hydrogen bond donors, number of hydrogen bond acceptors, molecular polar surface area, number of rotatable bonds, number of aromatic rings, and number of structural alerts. While this is simpler than a typical preclinical target-product profile, it combines either 3 or 10 scoring components, depending upon how you count, and is thus representative of the sort of multi-property optimization required for drug discovery.
The budget of 100 compounds is intended to reflect the first iteration of hit discovery/lead optimization. Each of the compounds will generally comprise a distinct chemotype, and cost on the order of $1000 to synthesize, so significantly larger synthesis budgets would quickly become onerous. Enki is designed to perform multi-parameter optimization, and we have successfully satisfied over 10 potency, selectivity, and physicochemical constraints with synthesis budgets much smaller than 100 novel compounds. Additional rounds of lead optimization would certainly be required, but these rounds are generally elaborations of a single (or at most a few) scaffolds. This constitutes a different, constrained optimization problem, but also significantly reduces the cost of synthesis per molecule.
Certainly interesting results but it is rather unusual to run 10k epochs with REINVENT, it is more typical to run < 1000 epochs and then run multiple replicates. As pointed out by Christian real-world examples run with multiple scoring components including biologically relevant endpoints (ADMET, PK). You have really only shown that you can improve on affinity which I understand really just means docking scores. That's just ligand design and very far off from candidate selection, not to mention actual clinical trials.
Synthesizability as estimated from a software tool may not be that informative. Have you actually syntesized any of your best 100 and assayed them?
I wonder if you would be willing to share your inputs for others to reproduce your findings.
Thanks for your comment, Hannes. We followed the practice of other prominent benchmarking efforts in our REINVENT protocol. In particular, Practical molecular optimization (Gao, et al., 2022), Tartarus (Nigam, et al., 2023), and "Generative models should at least be able to design molecules that dock well" (Cieplinski, et al., 2023), do not appear to use multiple restarts, based upon an analysis of both the papers and the released code. However, multiple restarts is certainly a sensible strategy to combine with REINVENT, and we will explore it in our future work.
In the selectivity experiments reported in this post, we optimized a multi-property objective including 10 scoring components: potency to an on-target, selectivity to a structurally related off-target, and 8 physicochemical properties. This is representative of the sort of multi-property optimization required for drug discovery. For the purpose of comparing multiple generative algorithms and high-throughput screening, it is essential that these 10 properties be efficiently evaluable, so we cannot use experimental potency and ADMET measurements for this sort of benchmarking.
When we use Enki for drug discovery, we train on experimental potency and ADMET measurements rather than docking scores, and evaluate by synthesizing and experimentally testing the generated compounds. We generally find that ~90% of our generated compounds are synthesizable, with a large fraction satisfying the potency, selectivity, and other criteria in the TPP. We look forward to sharing a case study in the near future.
Unfortunately, our dataset includes commercially licensed data, which we cannot share publicly.
Many thanks for your feedback. This is very valuable to us.
Doing proper benchmarking is a bit of an art and it seems a non-brainer to me to do replicate runs when simuating a stochastic process. When you have 10 scoring component simultaneously you need to take care that you do have components which push in diferent directions. We hav staged learning for this (basically successive RL aka curriculum learning). What would be interesting from your long runs is when (in which epoch) the interesting compounds are generated to see how efficient the process is. Also, what is the chemical space covered in the long tail of the run.
I understand the problem with sharing data but it would be good to see a public case study.
Many thanks.