Predicting the combinatorial code for any odorant molecule: a graph neural networks approach

Thu-P1-019

Presented by: Matej Hladis

Matej Hladis, Maxence Lalis, Sebastien Fiorucci, Jeremie Topin

Institute of Chemistry in Nice, Universite Cote d'Azur, France

Our sense of smell relies on the use of approximatively 400 genes expressing functional odorant receptors (ORs), endowing us with the power to discriminate a vast number of chemical stimuli. ORs can accept several different classes of molecules – ligands, and one molecule can activate different ORs, leading to a complex combinatorial code of olfaction. Cracking this code is a long-standing challenge and its first step relies on the identification of OR-ligand pairs.
To date, common procedure for OR's ligand identification has been based on in vitro search with rather low success rates of ~2%. Moreover, the data linking a molecule to a set of ORs are scarce and only 131 human ORs have an identified ligand. Thus, building a machine learning protocol linking molecules with ORs’ sequence remains challenging. To tackle this issue, we leverage recent advances in representation learning and combine them with graph neural networks (GNN) to build a receptor-ligand prediction model. To our knowledge, this is the first model for ORs’ ligand prediction that takes an entire protein sequence into account.
Several methods inspired by success of representation learning in the natural language processing (NLP) have been proposed to represent protein sequences. Here we use BERT which was previously trained on more than 200M protein sequences. We treat molecules as graphs and process ORs and molecules simultaneously using GNN.
Multiple experimental assays have been done to identify new ORs’ ligands, yet curated dataset of OR-molecule pairs is still missing. Therefore, to train and evaluate our model, we gathered a new dataset of more than 46 000 OR-molecule pairs putting together results from 31 publications. Using this data, we evaluated our model on a test set comprising more than 1500 high-accuracy tertiary screening data (EC₅₀). Our receptor-ligand model correctly identifies 70% of ligands with 69% precision and achieves Matthew’s correlation coefficient (MCC) of 0.60.