Analysing data in Yoda

Yoda is a data management solution and not explicitly meant for analysing data. However, this does not mean that you cannot analyse data that is stored in Yoda. On this page, we highlight example workflows for analysing data that is stored in Yoda. 

What type of analysis can you do in Yoda, and where to run it? 

Before you decide on the best workflow for your use case, you should ask yourself:
 

  • Which type of analysis will I run? 
  • Is this task suitable to run on a personal computer (PC)? If your analysis cannot be run on your PC, for example because your dataset is too large and you do not have enough storage, or your computing requirements are too heavy and the processing capacity of your machine is not big enough, you should think about using other analysis platforms, such as a virtual research environment (VRE) or high-performance computing facility (HPC). If you work at Utrecht University, you can plan an intake meeting with research engineering to get you started. If you work at another institution, please contact your own helpdesk

Below we discuss three possible workflows to work with data stored in Yoda:
 

  1. Downloading files from Yoda, performing the analysis, and uploading the results to Yoda again. 
  2. Mounting the Network Drive and performing the analysis on the device on which the Network Drive is mounted. 

  3. Streaming data in memory, without having to download the data from Yoda.

Downloading files and folders 

Suitable for:  

  • Analysis system: PC, VRE, HPC
  • Data: All file and folder sizes, assuming there is enough storage on the analysis system 

In this workflow, you download the files and folders that you want to analyse from Yoda to the system where you plan to run the analysis, i.e. you create a working copy of your data. You run the analysis on the system, and afterwards upload the data and/or results back to Yoda. You can also safely remove your working copy again, since the source data stays untouched in Yoda. In this way you can save storage space on the analysis system.  

The main reason for choosing this method is that it is relatively straightforward, and it will give you good performance when reading your file in your analysis script.  

There are several ways in which you can download and upload the files: 

  • Via the Yoda web portal: This can be done if you have an internet browser available (e.g., your PC and some VREs). You could choose this option when you are already familiar with the web portal, or when you do not want to install additional tools on your system. However, this method is not very reliable when transferring large files. Also, the web portal will not give you clear feedback on whether a download was completed correctly. 

  • Using iBridges: iBridges has a Graphical User Interface for manual down- and uploads, and a command line client and Python API for automated data transfers. iBridges can be installed on all operating systems and is available on most VREs at Utrecht University. In contrast to the Yoda web portal, iBridges can be used for large files and high-performance data transfers, and iBridges checks the integrity of downloaded files. 

  • Using iCommands or GoCommands: these command line tools provide slightly better performance for data transfer compared to iBridges, and also offer many features for working with metadata. 

Mount with Network Disk 

Suitable for: 

  • Analysis system: PC, VRE and some HPC systems 
  • Data: small operations on small files only 

Yoda can be mounted as a Network Disk on your system via the WebDAV protocol. The main advantage is that this method allows you to see the files in your file explorer as if they are on your computer. You can then perform your analysis on the analysis system as if the files were stored locally. However, we only recommend working with this method if you are working with a small number of small files (few MBs), or if you just want to browse files and folders. This is because when working with larger files, performance of operations like reading and writing files will be slow and can greatly increase the runtime of your analysis. In certain cases, you might run into errors because of this. When you make changes to a file or create a new file on Yoda, this method does not provide clear feedback about the ‘upload’ of those changes. If you interrupt the upload (e.g. by shutting down your PC), the changes might be lost. Since the files can be easily opened by an editor you also risk that you might change files on Yoda by accident.

Streaming 

Suitable for: 

  • Analysis system: PC, VRE, HPC 
  • Data analysis: When you use Python for your analysis 

Streaming is a more advanced method to analyse data in Yoda. Using iBridges in Python or the Python iRODS client, it is possible to directly load data into memory without having to download it to the analysis system. The main advantage of this method is that you do not create new copies of the data that you later have to remove, and your workflow becomes a lot cleaner. Streaming is especially useful when your data is organised in larger files and you only need extracts, i.e. you do not need all the content. Another use case for streaming is when you need to combine/append the content of many small files for your analysis.  

Output of your scripts can also be streamed directly to Yoda along with metadata. That means you do not need to first create a local file which contains the output, but you can directly create a file on Yoda and “stream” the output into that file.

If you need assistance with one of the described workflows or if you have a question about a different kind of workflow, please contact the helpdesk of your institution.