Proteins and their function play one of the most essential roles in various biological processes. The study of PPI is of considerable importance. PPI network data are of great scientific value, however, they are incomplete and experimental identification is time and money consuming. Available computational methods perform well on model organisms’ PPI prediction but perform poorly for a novel organism. Due to the incompleteness of interaction data, it is challenging to train a model for a novel organism. Also, millions to billions of interactions need to be verified which is extremely compute-intensive. We aim to improve the performance of predicting whether a pair of proteins will interact, with only two sequences as input. And also efficiently predict a PPI network with a proteome of sequences as input. We hypothesize that information about cellular locations where proteins are active and proteins' 3D structures can help us to significantly improve predict performance. To overcome the lack of experimental data, we use predicted structures by AlphaFold2 and cellular locations by DeepGoPlus. We believe that proteins belonging to disjoint biological components have very little chance to interact. We manually choose several disjoint pairs and further confirmed it by experimental PPI. We generate new no-interaction pairs with disjoint classes to update the D-SCRIPT dataset. As result, the AUPR has improved by 10% compared to the D-SCRIPT dataset. Besides, we pre-filter the negatives instead of enumerating all the potential PPI for de-novo PPI network prediction. For E.coli, we can pass around a million negative interactions. To combine the structure and sequence information, we generate a graph for each protein. A graph convolution network using Self-Attention Graph Pooling in Siamese architecture is used to learn these graphs for PPI prediction. In this way, we can improve around 20% in AUPR compared to our baseline model D-SCRIPT.
|Date made available
|KAUST Research Repository