100 AI-generated molecules are worth a 1,000,000 molecule high-throughput screen
The agony and the ecstasy of generative AI for small molecule drug discovery
Research conducted by the Variational AI team: Marawan Ahmed, Marshall Drew-Brook, Peter Guzzo, Ahmad Issa, Mehran Khodabandeh, Jason Rolfe, and Ali Saberali.
Drug discovery requires the identification of novel molecules that efficiently reach their site of action in the body, potently modulate a biological target that mediates disease, avoid interfering with other critical processes in the body, and are safely eliminated after an appropriate interval. The staggering difficulty of this task is reflected in the approximately 10 years and $2 billion required to develop a new drug (DiMasi, et al., 2016).
Artificial intelligence has been touted as a panacea for every step of this process, from synthesizing the biomedical literature into hypothetical drug targets (Savage, 2023), to recruiting patients for clinical trials (Hutson, 2024). Over $59 billion has been invested in companies purporting to use AI to discover new drugs. The successes claimed by these companies are often hyperbolic, but the cost of rigorous testing makes it difficult to separate fact from fiction.
In this post, we summarize the current state of small molecule hit discovery, propose an unbiased benchmark for generative AI algorithms, and show that 100 molecules created by Variational AI’s algorithm, Enki, are as effective as a high-throughput screen over 1,000,000 molecules.
Conventional approaches to hit discovery
A few drugs, including penicillin, quinine, and benzodiazepines, were stumbled upon by luck and vigilance. Others were the result of small modifications of existing drugs, designed to escape patents (Brown, 2023). When serendipity fails and incremental improvements are insufficient, rational drug discovery offers an alternative.
Rational drug discovery depends upon the identification of a biological target, generally a protein, for which inhibition, activation, or some other modulation is believed to have a beneficial medical effect. Once the target is chosen, rational discovery generally commences by evaluating a large, fixed set of molecules for potent hits against the target. This evaluation can be conducted experimentally, using biochemical or cellular assays, or virtually, using molecular docking or a QSAR (quantitative structure-activity relationship) model.
In an experimental high-throughput screen, tens of thousands to a few million drug-like but target-independent compounds, predeposited on 96, 384, or 1536-well plates, are tested for activity against the target at a single concentration. These measurements are subject to noise due to the formation of colloidal aggregates, direct assay interference, compound degradation or contamination, reactivity, and the like (Aldrich, et al., 2017). Correspondingly, most of the apparent hits in primary screens are usually revealed to be false positives by more accurate confirmatory and counter-screens (Shun, et al., 2011), and the median hit rate is significantly less than 1% (Jacoby, et al., 2005; Lloyd, 2020; Schuffenhauer, et al., 2005). The size of such screens can be scaled to a few billion compounds using DNA-encoded libraries (DELs), at the cost of restrictive chemistries and assays, and potential interference from the DNA labeling process (Peterson & Liu, 2023).
Virtual high-throughput screening can also expand the screening libraries to a few billion compounds by relying upon molecular docking or QSAR models in place of experimental potency assays (Gorgulla, et al., 2020). In a few such extremely large virtual screens, rates of 10-40% have been reported (Bender, et al., 2021), but hit rates around 10% are more typical (Damm-Ganamet, et al., 2019; Slater & Kontoyianni, 2019; Zhu, et al, 2013). The success of virtual screening can depend upon the virtuosity of the docking protocol design and manual hit picking, and the tractability of the target (Zhu, et al., 2022). Recent experiments have shown that as the size of the virtual screening library increases to one billion molecules, almost all top-ranked molecules are artifacts of the scoring function used by molecular docking (Lyu, et al., 2023).
As a conspicuous example of these difficulties, tens of thousands of papers applied virtual screening to SARS-CoV-2 over the course of the COVID-19 pandemic, but few developable leads resulted from this extensive effort (Gentile, et al., 2023; Macip, et al, 2021). Similarly, in the recent CACHE challenge to find molecules that bind to the WD40 domain of LRRK2 (a previously undrugged target), none of the 1,955 tested compounds achieved Kd < 10μM (Ackloo, et al., 2022; results of CACHE challenge #1). Across 318 virtual high-through screening campaigns against challenging targets, Atomwise only observed reliable experimental activity on ~3% of their virtual hits, with some of these hits as weak as 865 μM, and no hits at all identified for 26% of the targets (Wallach, et al., 2024).
Machine learning enables many of the largest virtual screens. An ML regressor is trained on the docking scores of a small subset of the virtual molecule library, and then used as a low-cost approximation to triage the remaining compounds (Gentile, et al., 2020). The inaccuracy of this triage can be reduced by applying multiple active learning cycles, in which the machine learning regressor is fine-tuned on the true docking scores of the molecules selected in the previous round of triaging. However, accuracy is still bounded by that of molecular docking.
Even the largest libraries used in DELs and ML-augmented virtual screens cover approximately 0% of the 1023 to 1060 synthesizable, drug-like molecules that are believed to exist (Ertl, 2023; Polishchuk, et al., 2013; Bohacek, et al, 1996). Perhaps more importantly, these screens only search for potency to a single on-target. Discovering potent hits is actually the easiest step in drug discovery, historically taking around one year and $1M in a large pharma environment (Paul, et al., 2010). Engineering selectivity and ADMET (absorption, distribution, metabolism, excretion, and toxicity) constraints into such hits is a challenge left for hit-to-lead and lead optimization, a significantly more difficult task requiring 3.5 years and $12.5M. Even then, this process may overlook the best drug candidates, with exceptional selectivity and ADMET, which may lie in regions of chemical space that are distant from the compounds with the highest potency to the primary target.
Generative AI promises to revolutionize hit discovery by efficiently searching over a significant fraction of the 1060 synthesizable, drug-like molecules, and jointly optimizing for potency, selectivity, ADME, and toxicity. However, promise is not the same thing as performance, and the field is rife with unsubstantiated hype.
The agony of benchmarking generative AI for hit discovery
The synthesis and experimental testing of an AI-generated molecule with a novel chemotype requires thousands of dollars and months of effort. As a result, many groups enlist computational and medicinal chemists to choose only a handful of molecules for synthesis and experimental evaluation, out of the tens of thousands of candidates constructed by their generative AI. Zhavoronkov, et al. (2019) selected 6 compounds for synthesis out of 30,000 produced by their generative AI algorithm; Tan, et al. (2021) selected 2 out of 19,929; Yoshimori, et al. (2021) selected 9 out of 570,542; Jang, et al. (2022) selected 1 out of 10,416; Li, et al (2022) selected 8 out of 79,323; Ren, et al. (2023) selected 7 out of 8,918; and Chenthamarakshan, et al. (2023) selected 4 out of 875,000.
It is unclear what proportion of the real work is being done by human experts in the process of selecting 0.01% of the AI-generated molecules. Indeed, the AI algorithm itself may almost exclusively produce inactive, non-selective, unsynthesizable, or non-drug-like molecules (Gao & Coley, 2020). And since only one algorithm is tested in each of these efforts, it is impossible to compare the AI algorithms to each other, or to conventional techniques like high-throughput screening.
To conduct an unbiased, statistically meaningful evaluation of generative AI methods, hundreds of molecules must be selected by each AI algorithm for a common task, without human assistance, and then subject to experimental testing. This effort would cost millions of dollars in a wet lab, and so is unlikely to ever be undertaken. If we want to probe the utility of generative AI for small molecule drug discovery, we need to construct a proxy molecular property that is analogous to experimental potency, but fast and cheap to evaluate.
The proxy property should have the same kind of relationship to molecule structure as experimental potency, so that if we can optimize the proxy property, we can have justifiable confidence in our ability to optimize experimental potency. At the same time, the proxy property need not perfectly approximate the potency for any particular protein target. Rather, it can correspond to the potency for some novel, hypothetical protein. A generative AI algorithm that can consistently optimize potency for many such hypothetical proteins should also be able to maximize experimental potency for a real protein.
Docking scores are an effective proxy for experimental potency
Molecular docking scores are a natural surrogate for experimental potencies. Docking is based upon the 3D geometry and pharmacophoric interactions between a flexible ligand and its target binding pocket in an optimized pose; the same interactions that mediate experimental potency. It is computationally non-trivial, requiring the minimization of a highly nonlinear function, and taking up to 100 seconds to compute a single score (e.g., Glide XP). The strong connection between docking scores and experimental potency is evident from their significant correlation, as shown in Figure 1.
Docking scores are almost as difficult to predict as experimental potency. Across a set of 26 kinase targets and using a temporal train/test split, the average correlation coefficient between standard QSAR models (random forests on extended connectivity fingerprints) and experimental log IC50 is 0.38, whereas the average correlation coefficient for docking scores is 0.52 when the same molecules are labeled. In contrast, physicochemical properties that are often used to benchmark molecular optimization, such as QED (the quantitative estimate of drug-likeness; Bickerton, et al., 2012), are much simpler. The same QSAR architecture has a correlation coefficient of 0.70 on QED when the same molecules are labeled.
To match the sparsity pattern of experimental potency data, we replace each log IC50/Ki/Kd measurement in our experimental dataset with the corresponding docking score. Our dataset aggregates high-quality potency measurements from over 9,000 papers and 13,000 patents, and includes between 744 and 25,626 labeled compounds per target, as shown in Figure 2. This dataset recapitulates the statistical structure of experimental potency as closely as possible, while allowing the true properties of novel molecules to be evaluated quickly and inexpensively.
While docking scores do not accurately account for induced fit, networks of discrete water molecules, entropy, or the nuances of quantum mechanics (Pantsar & Poso, 2018), a generative AI algorithm that cannot successfully optimize docking scores will certainly fail on experimental activity. Somewhat higher fidelity might be realized by using absolute binding free energy calculations in place of molecular docking. However, this would require hundreds of thousands of GPU-hours for each optimization task, costing hundreds of thousands of dollars on AWS or other cloud compute environments, where GPUs cost at least $2/hr.
The ecstasy: 100 AI-generated molecules are worth 1,000,000 random molecules
Using docking scores as a proxy task, we evaluate optimization on two potency and two selectivity objectives defined over three kinase targets of significant pharmacological interest. Specifically, we maximize the following objectives:
where QED is the quantitative estimate of drug-likeness (Bickerton, et al., 2012), which ensures that the optimized molecules satisfy Lipinski’s Rule of 5 and are free of structural alerts. The docking scores are computed using Gnina’s CNNaffinity, a machine learning scoring function that is calibrated to -log IC50 (McNutt, et al., 2021).
We train our generative AI algorithm, Enki, on the proxy docking score dataset, where each log IC50/Ki/Kd label in our experimental dataset is replaced with the corresponding docking score. Enki then generates 100 optimized molecules without human intervention for each objective. We evaluate the true (proxy) properties for these 100 molecules, and compute the true value of the objective. We also perform a high-throughput screen by evaluating the true objective value for ~1.3M molecules that have previously been experimentally tested for kinase activity, ~0.4M molecules that have been tested for activity for other target classes, and ~0.5M molecules from the Enamine, WuXi, Otava, and Mcule make-on-demand sets. Half of the make-on-demand molecules were constrained to have a hinge binding scaffold, which is typical of kinase inhibitors. The reported HTS library size varies across the objectives, since some molecules fail to dock for each target. The results are depicted in Figures 3 and 4.
For three of the four objectives, the best of the 100 Enki-optimized molecules is superior to any of the ~2M high-throughput screening molecules. For the CDK5 vs. EGFR objective, a high-throughput screen of ~150k molecules would be required to find a compound as good as the best Enki-optimized molecule.
The Enki-optimized molecules are novel and diverse, as demonstrated in Figure 5, 6, and 7. We also evaluated synthesizability by performing retrosynthetic pathway prediction using Molecule.one. The distribution of the predicted number of synthetic steps is shown in Figure 8. For all four tasks, over 90% of the Enki-optimized molecules were predicted to be synthesizable in fewer than ten steps.
Finally, we compare Enki-optimized molecules to those produced by state-of-the-art molecular optimization algorithms. Recent benchmarking efforts have found that REINVENT (Olivecrona, et al., 2017; Loeffler, et al., 2024) and graph genetic algorithms (Graph GA; Jensen, 2019) remain the most powerful algorithms for optimizing pharmacological properties over chemical space (Gao, et al., 2022; Nigam, et al., 2024). To adapt REINVENT and Graph GA to the real-world hit discovery setting, where data is available on previously investigated compounds but only a single round of novel molecules can be tested experimentally, we equipped them with a QSAR model consisting of a random forest regressor operating on extended connectivity fingerprints. This architecture continues to achieve state-of-the-art performance for small molecule potency prediction (Cichońska, et al., 2021; Huang, et al., 2021; Luukkonen, et al., 2023; Stanley, et al., 2021; van Tilborg, et al., 2022). As Figure 9 shows, when each algorithm was used to generate 100 optimized molecules for each task, Enki produced superior molecules, as measured by both the mean over all 100 molecules, as well as when only considering the best molecules.
Conclusion
Generative AI has been extolled as a solution to small-molecule hit discovery and lead optimization, but unbiased evaluation is impractically expensive. To facilitate a fair assessment, we define a benchmark task that uses molecular docking as a proxy for, rather than an approximation to, experimental potency. Data is provided for only those ligand-target pairs for which experimental potencies are available, and only a single round of molecule generation is allowed, as in conventional wet lab hit discovery. We show that 100 molecules designed by Enki, our generative AI algorithm, are superior to a high-throughput screen of 1,000,000 molecules, and outperform the previous state-of-the-art molecular optimization algorithms. In addition, Enki-optimized molecules are novel, diverse, and synthesizable.
References
Ackloo, S., Al-Awar, R., Amaro, R. E., Arrowsmith, C. H., Azevedo, H., Batey, R. A., ... & Willson, T. M. (2022). CACHE (Critical Assessment of Computational Hit-finding Experiments): a public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nature Reviews Chemistry, 6(4), 287-295.
Aldrich, C., Bertozzi, C., Georg, G. I., Kiessling, L., Lindsley, C., Liotta, D., ... & Wang, S. (2017). The ecstasy and agony of assay interference compounds. ACS Chemical Neuroscience, 8(3), 420-423.
Bender, B. J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C. M., Stein, R. M., ... & Shoichet, B. K. (2021). A practical guide to large-scale docking. Nature protocols, 16(10), 4799-4832.
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S., & Hopkins, A. L. (2012). Quantifying the chemical beauty of drugs. Nature chemistry, 4(2), 90-98.
Bohacek, R. S., McMartin, C., & Guida, W. C. (1996). The art and practice of structure‐based drug design: a molecular modeling perspective. Medicinal research reviews, 16(1), 3-50.
Brown, D. G. (2023). An analysis of successful hit-to-clinical candidate pairs. Journal of medicinal chemistry, 66(11), 7101-7139.
Chenthamarakshan, V., Hoffman, S. C., Owen, C. D., Lukacik, P., Strain-Damerell, C., Fearon, D., ... & Das, P. (2023). Accelerating drug target inhibitor discovery with a deep generative foundation model. Science Advances, 9(25), eadg7865.
Cichońska, A., Ravikumar, B., Allaway, R. J., Wan, F., Park, S., Isayev, O., ... & Challenge organizers. (2021). Crowdsourced mapping of unexplored target space of kinase inhibitors. Nature communications, 12(1), 3307.
Damm-Ganamet, K. L., Arora, N., Becart, S., Edwards, J. P., Lebsack, A. D., McAllister, H. M., ... & Mirzadegan, T. (2019). Accelerating lead identification by high Throughput virtual screening: prospective case studies from the pharmaceutical industry. Journal of Chemical Information and Modeling, 59(5), 2046-2062.
DiMasi, J. A., Grabowski, H. G., & Hansen, R. W. (2016). Innovation in the pharmaceutical industry: new estimates of R&D costs. Journal of health economics, 47, 20-33.
Ertl, P. (2003). Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups. Journal of chemical information and computer sciences, 43(2), 374-380.
Gao, W., & Coley, C. W. (2020). The synthesizability of molecules proposed by generative models. Journal of chemical information and modeling, 60(12), 5714-5723.
Gao, W., Fu, T., Sun, J., & Coley, C. (2022). Sample efficiency matters: a benchmark for practical molecular optimization. Advances in neural information processing systems, 35, 21342-21357.
Gentile, F., Agrawal, V., Hsing, M., Ton, A. T., Ban, F., Norinder, U., ... & Cherkasov, A. (2020). Deep docking: a deep learning platform for augmentation of structure based drug discovery. ACS central science, 6(6), 939-949.
Gentile, F., Oprea, T. I., Tropsha, A., & Cherkasov, A. (2023). Surely you are joking, Mr Docking!. Chemical Society Reviews, 52(3), 872-878.
Gorgulla, C., Boeszoermenyi, A., Wang, Z. F., Fischer, P. D., Coote, P. W., Padmanabha Das, K. M., ... & Arthanari, H. (2020). An open-source drug discovery platform enables ultra-large virtual screens. Nature, 580(7805), 663-668.
Hutson, M. (2024). How AI is being used to accelerate clinical trials. Nature, 627(8003), S2-S5.
Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., ... & Zitnik, M. (2021). Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548.
Jacoby, E., Schuffenhauer, A., Popov, M., Azzaoui, K., Havill, B., Schopfer, U., ... & Roth, H. J. (2005). Key aspects of the Novartis compound collection enhancement project for the compilation of a comprehensive chemogenomics drug discovery screening collection. Current topics in medicinal chemistry, 5(4), 397-411.
Jang, S. H., Sivakumar, D., Mudedla, S. K., Choi, J., Lee, S., Jeon, M., ... & Wu, S. (2022). PCW-A1001, AI-assisted de novo design approach to design a selective inhibitor for FLT-3 (D835Y) in acute myeloid leukemia. Frontiers in Molecular Biosciences, 9, 1072028.
Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical science, 10(12), 3567-3572.
Li, Y., Zhang, L., Wang, Y., Zou, J., Yang, R., Luo, X., ... & Yang, S. (2022). Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor. Nature Communications, 13(1), 6891.
Lloyd, M. D. (2020). High-throughput screening for the discovery of enzyme inhibitors. Journal of Medicinal Chemistry, 63(19), 10742-10772.
Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., & Engkvist, O. (2024). Reinvent 4: Modern AI–driven generative molecule design. Journal of Cheminformatics, 16(1), 20.
Luukkonen, S., Meijer, E., Tricarico, G. A., Hofmans, J., Stouten, P. F., van Westen, G. J., & Lenselink, E. B. (2023). Large-scale modeling of sparse protein kinase activity data. Journal of Chemical Information and Modeling, 63(12), 3688-3696.
Lyu, J., Irwin, J. J., & Shoichet, B. K. (2023). Modeling the expansion of virtual screening libraries. Nature Chemical Biology, 19(6), 712-718.
Macip, G., Garcia-Segura, P., Mestres-Truyol, J., Saldivar-Espinoza, B., Pujadas, G., & Garcia-Vallvé, S. (2021). A review of the current landscape of SARS-CoV-2 main protease inhibitors: Have we hit the bullseye yet?. International journal of molecular sciences, 23(1), 259.
McNutt, A. T., Francoeur, P., Aggarwal, R., Masuda, T., Meli, R., Ragoza, M., ... & Koes, D. R. (2021). GNINA 1.0: molecular docking with deep learning. Journal of cheminformatics, 13(1), 43.
Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L., ... & Aspuru-Guzik, A. (2024). Tartarus: A benchmarking platform for realistic and practical inverse molecular design. Advances in Neural Information Processing Systems, 36.
Olivecrona, M., Blaschke, T., Engkvist, O., & Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9, 1-14.
Pantsar, T., & Poso, A. (2018). Binding affinity via docking: fact and fiction. Molecules, 23(8), 1899.
Paul, S. M., Mytelka, D. S., Dunwiddie, C. T., Persinger, C. C., Munos, B. H., Lindborg, S. R., & Schacht, A. L. (2010). How to improve R&D productivity: the pharmaceutical industry's grand challenge. Nature reviews Drug discovery, 9(3), 203-214.
Peterson, A. A., & Liu, D. R. (2023). Small-molecule discovery through DNA-encoded libraries. Nature Reviews Drug Discovery, 22(9), 699-722.
Polishchuk, P. G., Madzhidov, T. I., & Varnek, A. (2013). Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of computer-aided molecular design, 27, 675-679.
Ren, F., Ding, X., Zheng, M., Korzinkin, M., Cai, X., Zhu, W., ... & Zhavoronkov, A. (2023). AlphaFold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel CDK20 small molecule inhibitor. Chemical Science, 14(6), 1443-1452.
Savage, N. (2023). Drug discovery companies are customizing ChatGPT: here’s how. Nat Biotechnol, 41(5), 585-586.
Schuffenhauer, A., Ruedisser, S., Marzinzik, A., Jahnke, W., Selzer, P., & Jacoby, E. (2005). Library design for fragment based screening. Current topics in medicinal chemistry, 5(8), 751-762.
Slater, O., & Kontoyianni, M. (2019). The compromise of virtual screening and its impact on drug discovery. Expert opinion on drug discovery, 14(7), 619-637.
Shun, T. Y., Lazo, J. S., Sharlow, E. R., & Johnston, P. A. (2011). Identifying actives from HTS data sets: practical approaches for the selection of an appropriate HTS data-processing method and quality control review. Journal of Biomolecular Screening, 16(1), 1-14.
Stanley, M., Bronskill, J. F., Maziarz, K., Misztela, H., Lanini, J., Segler, M., ... & Brockschmidt, M. (2021, August). Fs-mol: A few-shot learning dataset of molecules. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
Tan, X., Li, C., Yang, R., Zhao, S., Li, F., Li, X., ... & Zheng, M. (2021). Discovery of pyrazolo [3, 4-d] pyridazinone derivatives as selective DDR1 inhibitors via deep learning based design, synthesis, and biological evaluation. Journal of Medicinal Chemistry, 65(1), 103-119.
van Tilborg, D., Alenicheva, A., & Grisoni, F. (2022). Exposing the limitations of molecular machine learning with activity cliffs. Journal of Chemical Information and Modeling, 62(23), 5938-5951.
Wallach, I. & The Atomwise AIMS Program. (2024). AI is a viable alternative to high throughput screening: a 318-target study. Scientific Reports, 14(7526).
Yoshimori, A., Asawa, Y., Kawasaki, E., Tasaka, T., Matsuda, S., Sekikawa, T., ... & Kanai, C. (2021). Design and synthesis of DDR1 inhibitors with a desired pharmacophore using deep generative models. ChemMedChem, 16(6), 955-958.
Zhavoronkov, A., Ivanenkov, Y. A., Aliper, A., Veselov, M. S., Aladinskiy, V. A., Aladinskaya, A. V., ... & Aspuru-Guzik, A. (2019). Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nature biotechnology, 37(9), 1038-1040.
Zhu, T., Cao, S., Su, P. C., Patel, R., Shah, D., Chokshi, H. B., ... & Hevener, K. E. (2013). Hit identification and optimization in virtual screening: Practical recommendations based on a critical literature analysis: Miniperspective. Journal of medicinal chemistry, 56(17), 6560-6572.
Zhu, H., Zhang, Y., Li, W., & Huang, N. (2022). A comprehensive survey of prospective structure-based virtual screening for early drug discovery in the past fifteen years. International Journal of Molecular Sciences, 23(24), 15961.
Thanks for the literature overview!
As for the comparison to other methods: In my mind, instead of using a QSAR model, a transfer learning scenario would be more applicable. At least REINVENT uses iterative optimization (reinforcement learning) in contrast to pre-training (Enki) by default, so generating a mere 100 compounds will (almost) sample from the initial prior (or did you perform some sort of "warm-up" phase?) - so it is no surprise that you don't get any enrichment. Lastly, the choice of a 100 generated compounds seems extremely low: In a real-world application, when you optimize 10+ scoring components at the same time and not a single objective, it's quite likely that only a few of your 100 molecules will strike a balance that is appealing to a MedChemist