Automatic Protein Function Annotation Through Text Mining

Student thesis: Master's Thesis


The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.
Date of AwardAug 25 2019
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Sciences and Engineering
SupervisorRobert Hoehndorf (Supervisor)


  • Protein function
  • Gene Ontology
  • Text Mining
  • Biomedical
  • Annotation
  • Automatic

Cite this