Similarity Algorithms for Embeddable Objects

  • Anas Ismail

Student thesis: Doctoral Thesis

Abstract

The need to measure similarity between two objects is everywhere. It is not always clear what it means for two objects to be similar. The definition changes depending on the area of application. However, similarity between two objects is generally defined as an inverse function to the distance between them. Also it is not always easy to apply distance functions on objects directly. Sometimes, we have to transform them or embed them in another space first before we can calculate distance and subsequently similarity. We introduce three similarity algorithms/measures to quantify similarity between objects in different applications. First, we propose the first non brute force algorithm to calculate the Gromov hyperbolicity constant. We present several approximate and exact algorithms to solve this problem. For example, we provide an exact algorithm to compute the hyperbolicity constant in time O (n3:686) for a discrete metric space. We also show that hyperbolicity at a fixed base-point cannot be computed in O(n2:05) time, unless there exists a faster algorithm for (max,min) matrix multiplication than currently known. Then, we present a new system to find proteins similar in functionality. We employ text mining techniques to map text similarity to similarity in functionality. We use manually curated data from Swiss-Prot to train and build our system. The result is a search engine that given a query protein, reports the top similar proteins in functionality with 99% accuracy. The system is tested extensively using GO annotations. We used this system, that predicts similarity in function, to enhance protein annotations. In particular, we were able to predict that some GO annotations should be added to some proteins. After careful literature reviews we were able to con rm many of those predictions, for example, in one case, we have 96% prediction accuracy. We also present a new algorithm for measuring the similarity between GPS traces. Our algorithm is robust against subsampling and supersampling. We perform experiments to compare this new similarity measure with the two main approaches that have been used so far: Dynamic Time Warping (DTW) and the Euclidean distance and our algorithm outperforms both of them in most of the cases.
Date of AwardOct 31 2019
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Sciences and Engineering
SupervisorVladimir Bajic (Supervisor)

Keywords

  • algorithms
  • trajectory
  • protein
  • data mining
  • Gromov
  • Doc2 Vec

Cite this

'