Foundation Models for Discovery and Exploration in Chemical Space
Alexius Wadell, Anoushka Bhutani, Victor Azumah, Austin R. Ellis-Mohr, Andrew J. Stier, Kareem Hegazy, Alexander Brace, Hancheng Zhao, Celia Kelly, Anuj K. Nayak, Yuhan Chen, Dimitrios Simatos, Hongyi Lin, Murali Emani, Venkatram Vishwanath, Kevin Gering, Melisa Alkan, Tom Gibbs, Jack Wells, Wesley W. Qian, Richard C. Gerkin, Benjamin Amorelli, Alexander B. Wiltschko, Lav R. Varshney, Bharath Ramsundar, Karthik Duraisamy, Michael W. Mahoney, Arvind Ramanathan, Venkatasubramanian Viswanathan
Why It Matters
What makes this one worth your time
This work is relevant for AI researchers and engineers interested in materials discovery and optimization, offering scalable models that can predict complex chemical properties efficiently.
MIST models advance chemical space exploration with state-of-the-art predictions and efficient training methods.
Summary
The paper introduces MIST, a family of molecular foundation models, trained on large datasets with a novel tokenizer, Smirk, to predict structure-property relationships in chemical space. The models are fine-tuned for over 400 relationships and demonstrate state-of-the-art performance across various benchmarks. They also address real-world problems like electrolyte solvent screening and olfactory perception mapping. The paper proposes hyperparameter aware Bayesian neural scaling laws to optimize training on limited compute resources.
Key contributions
- Development of MIST, a family of molecular foundation models with extensive parameterization.
- Introduction of the Smirk tokenizer for comprehensive molecular information capture.
- Proposal of hyperparameter aware Bayesian neural scaling laws for efficient model training.
Notable insights
- The use of a novel tokenizer, Smirk, to capture nuclear, electronic, and geometric information comprehensively.
- Hyperparameter aware Bayesian neural scaling laws that reduce the need for extensive hyperparameter sweeps.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2510.18900v2 Announce Type: replace-cross Abstract: Accurate prediction of atomistic, thermodynamic, and kinetic properties from molecular structures underpins materials innovation. Existing computational and experimental approaches lack the scalability required to navigate chemical space efficiently. Scientific foundation models trained on large unlabelled datasets offer a path towards navigating chemical space across application domains. Here, we develop MIST, a family of molecular foundation models with up to an order of magnitude more parameters and data than prior works. Trained using a novel tokenizer, Smirk, which comprehensively captures nuclear, electronic, and geometric information, MIST learns a diverse range of molecules. MIST models have been fine-tuned to predict more than 400 structure-property relationships and have been shown to match or exceed state-of-the-art performance across diverse benchmarks, from physiology to electrochemistry. We demonstrate the ability of these models to solve real-world problems across chemical space from multiobjective electrolyte solvent screening to stereochemical reasoning for organometallics and mixture property prediction. The clearest demonstration of a foundation model is its ability to solve problems that were neither explicit targets of training nor central to the intentions of its developers. We identify olfactory perception mapping as such a problem, and show that MIST accurately predicted scent profiles and learned a hierarchical representation of olfactory space consistent with hyperbolic geometry. We formulated hyperparameter aware Bayesian neural scaling laws which eliminate the need for hyperparameter sweeps at every scale, making training large compute-optimal models feasible on a limited compute budget. The methods and findings presented here represent a significant step towards accelerating materials discovery, design, and optimization using foundation models.