Let’s train a single large language model on for as many different chemistry tasks as possible.
Let’s imagine a few tasks;
Property prediction
For examples, tasks could be;
- “The logP of C1=CC=NC=C1 is “ -> “0.65”
- “The boiling point of CC(C)Cc1ccc(C(C)C(=O)O)cc1 at 1atm is” -> “157 C”
- any / all physical properties. (molecular weight, melting point, pKa, NMR shifts, refractive index, … )
There are many sources of data for these tasks, for example; Reaxys, Pubchem.
(graph) structure elucidation
We could prompt the LLM with;
- Prompt: “Molecular formula: C4H6O4. Observed 13C NMR peaks: 28.9, 173.9. Elucidated structure: __________“
- Completion: “C(CC(=O)O)C(=O)O”
Or;
- “Molecular formula: C21H22N2O2. Observed 13C NMR peaks: 1.22, 1.41, 1.83, 1.84, … Elucidated structure: __________“
- Completion: “O=C1CC2OCC=C3C4C2C2N1c1ccccc1C12CCN(C1C4)C3”
We could use data from NMRShiftDB (Kuhn & Schlörer, 2015) for this task.
Translation from natural language to synthetic ‘program’
The input prompt could be;
- Prompt: “Text: A vial was charged with LiCl (0.32 g) in anhydrous THF (15 mL). Program: __________“
- Completion: ”<AddSolid vessel=”reactor” reagent=”LiCl” mass=”0.32 g” \/><Add vessel=”reactor” reagent=”THF” volume=”15 mL” \/>“
A dataset for this could be sourced from orgsyn or Pistachio_2017.
What are other tasks we could include?
- reactivity?
- drug design?
- retrosynthesis?
All of chemistry in a LLM!
Future work could investigate how to include other types of chemical data. For example;
- QM9 (3D positions, electron densities)
- Open catalyst project (trajectories, forces and energies)
- Theoretical knowledge. Like the dataset used here or what about past exams?
Motivation / background
Recently there have been two important observations with AI;
- Deep learning works when done at scale. The bigger the better.
- The ability of strings to represent many different kinds of information allows great flexibility which tasks a LLM can be trained to do.
Deep learning at scale
- More parameters
- Gopher (“Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” 2021). 280B parameters
- PaLM (Chowdhery et al., 2022). 540B parameters
- Switch Transformers (Fedus et al., 2021). 1T parameters
- More data
- MassiveText (“Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” 2021). 2 trillion tokens
- LTIP (Schuhmann et al., 2021). 400 million image-caption pairs
- More tasks
- A generalist agent GATO (Reed et al., 2022). Trained on;
- Simulated control tasks (596 tasks) (DM Control, DM lab, Procgen, Atari ALE, playroom, … and more)
- vision and language (>204 tasks) (MassiveText, MultiModal MassiveWeb, LTIP, OKVQA, … and more)
- A generalist agent GATO (Reed et al., 2022). Trained on;
The flexibility of strings and power of LLMs
A language model is a ‘model’ of language. It models language as a prediction problem: given context predict a distribution over the likely next token.
- “The cat in the _” -> “hat” (probably)
- “Monday, Tuesday, Wednesday, _” -> Thursday (probably)
- “The most populous city in India is _” -> Mumbai (probably)
Given a language model, we can frame many NLP tasks as predicting the next token, given a prompt. This lead some to claim “Language models are all you need” (for NLP tasks) (Namazifar et al., 2020).
For example; narrative understanding, textual entailment, entity resolution, question answering, POS tagging, grammatic parsing… can all be framed as predicting the next token. And thus can be done with a LLM.
- textual entailment: “text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences.” -> “positive” (text entails hypothesis)
- POS tagging: “text: Bob made a book collector happy.” -> “subject verb object(article adjective noun) verb-modifier”
- sentiment analysis: “text: I love this movie!” -> “positive”
- question answering: “text: The capital of France is _” -> “Paris”
Aside: more fun tasks LMs can do
Big-Bench (Srivastava et al., 2023)
Analyses a LM’s ability to do 204 different tasks including.
- auto_debugging
- ”’\nfor i in range(10):\n\ti’ What is the value of i the third time line 2 is executed?” -> “2”
- color matching
- “What is the color most closely matching this RGB representation: rgb(128, 2, 198)?” -> “purple”
- chess_state_tracking_legal_moves
- “e2e4 g7g6 d2d4 f8g7 c1e3 g8f6 f2f3 d7d6 d1” -> “c1, d2, d3, e2”
others
- ascii mnist, solve riddles, play sudoku, translate hindi proverbs, idendify math thorems, vitamin C fact verification, etc…
Conclusion
Molecules and analytical data can be represented as strings. This allows us to use LLMs to do chemistry tasks. But how can we cram enough knowledge into a LLM to make it useful for chemistry? Let’s train a single large language model on for as many different chemistry tasks as possible.
Bibliography
- Kuhn, S., & Schlörer, N. E. (2015). Facilitating quality control for spectra assignments of small organic molecules: nmrshiftdb2 – a free in‐house NMR database with integrated LIMS for academic service laboratories. Magnetic Resonance in Chemistry, 53, 582–589. https://api.semanticscholar.org/CorpusID:2571037
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. (2021). ArXiv, abs/2112.11446. https://api.semanticscholar.org/CorpusID:245353475
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., … Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways. ArXiv, abs/2204.02311. https://api.semanticscholar.org/CorpusID:247951931
- Fedus, W., Zoph, B., & Shazeer, N. M. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ArXiv, abs/2101.03961. https://api.semanticscholar.org/CorpusID:231573431
- Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. https://arxiv.org/abs/2111.02114
- Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., & de Freitas, N. (2022). A Generalist Agent. https://arxiv.org/abs/2205.06175
- Namazifar, M., Papangelis, A., Tur, G., & Hakkani-Tür, D. (2020). Language Model is All You Need: Natural Language Understanding as Question Answering. https://arxiv.org/abs/2011.03023
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., … Wu, Z. (2023). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://arxiv.org/abs/2206.04615