Predicting protein-peptide binding sites - an LLM based approach

Highlights

We used ProtBert, a large language model pre-trained on billions of amino acids, to extract features from the sequences.
WLater, we trained a model, comprising both CNN and RNN, to predict the binding sites.
Our results were on par with the state-of-the-art methods that take only sequence-related information as input, achieving an MCC score of 0.39.

Abstract

Protein-peptide binding sites are crucial to our understanding of several cellular processes. Due to the lack of experimental data, especially information related to protein structure, this is a tricky problem to conquer. Recently, many machine learning models have been developed to tackle this issue, but have not performed very well without structural information. We propose a deep learning based technique that takes only protein sequence as input and predicts binding sites. We have leveraged pretrained language models and undersampling techniques to make the model more robust. We also proposed an attention based mechanism to increase the explainability of the model.