Predicting protein-peptide binding sites - an LLM based approach

A pretrained language model was used to extract features from protein sequences. This is a structure agnostic approach to predict peptide binding sites.
Project Duration: 2022


Highlights

  • We used ProtBert, a large language model pre-trained on billions of amino acids, to extract features from the sequences.
  • WLater, we trained a model, comprising both CNN and RNN, to predict the binding sites.
  • Our results were on par with the state-of-the-art methods that take only sequence-related information as input, achieving an MCC score of 0.39.

Github Repository
Download Full Text

Abstract

Protein-peptide binding sites are crucial to our understanding of several cellular processes. Due to the lack of experimental data, especially information related to protein structure, this is a tricky problem to conquer. Recently, many machine learning models have been developed to tackle this issue, but have not performed very well without structural information. We propose a deep learning based technique that takes only protein sequence as input and predicts binding sites. We have leveraged pretrained language models and undersampling techniques to make the model more robust. We also proposed an attention based mechanism to increase the explainability of the model.