Embedding an Interactive Visualization of High Dimensional Data with Application to Fungi Genome Sequence Data (eScience Center) - DemoLAB

We present the tool Embed-Dive for interactive visualization of high dimensional data.  The visualization is based on embedding a similarity matrix into 3-dimensional vector space. The (sparse) similarity matrix consists of pairwise similarities between data entities, in our application fungi genomic sequence similarities based on edit (mutations) distance (“BLAST” score).  For embedding we are using the LargeVis algorithm developed by Microsoft Research (2016). The embedding algorithm is generating 3-dimensional coordinates for each data point for visualization purposes, such that data points that represent very similar sequences are also very close to each other in the visualization. In this way hierarchical clusters of data can be visually detected. The visualization tool can be used in parallel with clustering, such that the embedding and the clustering algorithm cross-validate each other, or in investigating certain outliers or to interactively search and inspect millions of data points.

Sonja Georgievska, The Netherlands eScience Center
In collaboration with: Westerdijk Fungal Biodiversity Institute, Utrecht

Data Science Day is this year’s edition of the annual IT Day, an initiative of Information and Technology Services (ITS) which has taken place since 2016. Each year, the event focuses on current IT trends and developments impacting education, research and work at Utrecht University. Depending on the theme chosen, the event is either intended for all UU colleagues (and external partners) or for a group of colleagues with a common interest.