Language models are all you need

for SMILES-based chemistry

November 10, 2022

Let’s train a single large language model on for as many different chemistry tasks as possible.

This post was adapted from a presentation.

Let’s imagine a few tasks;

Property prediction

For examples, tasks could be;

“The logP of C1=CC=NC=C1 is “ -> “0.65”
“The boiling point of CC(C)Cc1ccc(C(C)C(=O)O)cc1 at 1atm is” -> “157 C”
any / all physical properties. (molecular weight, melting point, pKa, NMR shifts, refractive index, … )

There are many sources of data for these tasks, for example; Reaxys, Pubchem.

(graph) structure elucidation

We could prompt the LLM with;

Prompt: “Molecular formula: C4H6O4. Observed 13C NMR peaks: 28.9, 173.9. Elucidated structure: __________“
Completion: “C(CC(=O)O)C(=O)O”

Or;

“Molecular formula: C21H22N2O2. Observed 13C NMR peaks: 1.22, 1.41, 1.83, 1.84, … Elucidated structure: __________“
Completion: “O=C1CC2OCC=C3C4C2C2N1c1ccccc1C12CCN(C1C4)C3”

We could use data from NMRShiftDB (Kuhn & Schlörer, 2015) for this task.

Translation from natural language to synthetic ‘program’

The input prompt could be;

Prompt: “Text: A vial was charged with LiCl (0.32 g) in anhydrous THF (15 mL). Program: __________“
Completion: ”<AddSolid vessel=”reactor” reagent=”LiCl” mass=”0.32 g” \/><Add vessel=”reactor” reagent=”THF” volume=”15 mL” \/>“

A dataset for this could be sourced from orgsyn or Pistachio_2017.

What are other tasks we could include?

reactivity?
drug design?
retrosynthesis?

All of chemistry in a LLM!

Future work could investigate how to include other types of chemical data. For example;

QM9 (3D positions, electron densities)
Open catalyst project (trajectories, forces and energies)
Theoretical knowledge. Like the dataset used here or what about past exams?

Motivation / background

Recently there have been two important observations with AI;

Deep learning works when done at scale. The bigger the better.
The ability of strings to represent many different kinds of information allows great flexibility which tasks a LLM can be trained to do.

Deep learning at scale

More parameters
- Gopher (et al., 2021). 280B parameters
- PaLM (Chowdhery et al., 2022). 540B parameters
- Switch Transformers (Fedus et al., 2021). 1T parameters
More data
- MassiveText (et al., 2021). 2 trillion tokens
- LTIP (Schuhmann et al., 2021). 400 million image-caption pairs
More tasks
- A generalist agent GATO (Reed et al., 2022). Trained on;
  - Simulated control tasks (596 tasks) (DM Control, DM lab, Procgen, Atari ALE, playroom, … and more)
  - vision and language (>204 tasks) (MassiveText, MultiModal MassiveWeb, LTIP, OKVQA, … and more)

The flexibility of strings and power of LLMs

A language model is a ‘model’ of language. It models language as a prediction problem: given context predict a distribution over the likely next token.

“The cat in the _” -> “hat” (probably)
“Monday, Tuesday, Wednesday, _” -> Thursday (probably)
“The most populous city in India is _” -> Mumbai (probably)

Given a language model, we can frame many NLP tasks as predicting the next token, given a prompt. This lead some to claim “Language models are all you need” (for NLP tasks) (Namazifar et al., 2020).

For example; narrative understanding, textual entailment, entity resolution, question answering, POS tagging, grammatic parsing… can all be framed as predicting the next token. And thus can be done with a LLM.

textual entailment: “text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences.” -> “positive” (text entails hypothesis)
POS tagging: “text: Bob made a book collector happy.” -> “subject verb object(article adjective noun) verb-modifier”
sentiment analysis: “text: I love this movie!” -> “positive”
question answering: “text: The capital of France is _” -> “Paris”

Aside: more fun tasks LMs can do

Big-Bench (Srivastava et al., 2023)

Analyses a LM’s ability to do 204 different tasks including.

auto_debugging
- ”’\nfor i in range(10):\n\ti’ What is the value of i the third time line 2 is executed?” -> “2”
color matching
- “What is the color most closely matching this RGB representation: rgb(128, 2, 198)?” -> “purple”
chess_state_tracking_legal_moves
- “e2e4 g7g6 d2d4 f8g7 c1e3 g8f6 f2f3 d7d6 d1” -> “c1, d2, d3, e2”

others

ascii mnist, solve riddles, play sudoku, translate hindi proverbs, idendify math thorems, vitamin C fact verification, etc…

Conclusion

Molecules and analytical data can be represented as strings. This allows us to use LLMs to do chemistry tasks. But how can we cram enough knowledge into a LLM to make it useful for chemistry? Let’s train a single large language model on for as many different chemistry tasks as possible.

Bibliography

Kuhn, S., & Schlörer, N. E. (2015). Facilitating quality control for spectra assignments of small organic molecules: nmrshiftdb2 - a free in-house NMR database with integrated LIMS for academic service laboratories. Magnetic Resonance in Chemistry, 53, 582–589. https://api.semanticscholar.org/CorpusID:2571037
et al., J. W. R. (2021). Scaling Language Models: Methods, Analysis and Insights from Training Gopher. ArXiv, abs/2112.11446. https://api.semanticscholar.org/CorpusID:245353475
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., … Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways. ArXiv, abs/2204.02311. https://api.semanticscholar.org/CorpusID:247951931
Fedus, W., Zoph, B., & Shazeer, N. M. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ArXiv, abs/2101.03961. https://api.semanticscholar.org/CorpusID:231573431
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., & Komatsuzaki, A. (2021). LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. https://arxiv.org/abs/2111.02114
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J., Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell, R., Vinyals, O., Bordbar, M., & de Freitas, N. (2022). A Generalist Agent. https://arxiv.org/abs/2205.06175
Namazifar, M., Papangelis, A., Tur, G., & Hakkani-Tür, D. (2020). Language Model is All You Need: Natural Language Understanding as Question Answering. https://arxiv.org/abs/2011.03023
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., … Wu, Z. (2023). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://arxiv.org/abs/2206.04615