How to synchronize data with HPC platforms

At the ITS department of the UU, we have developed a guide on how to manage and transfer your data efficiently when working on HPC platforms

The Issue

When you take the step of performing data-analyses (or simulations) from a PC to High Performance Computing (HPC) platforms, you are faced with new challenges of managing data, by running analyses on multiple platforms (e.g. on both your PC and an HPC platform). Besides, the amount of data may be too large to efficiently store it locally on your PC, and HPC clusters are typically not meant for long-term storage of your data. Because of this way of working on 2 (or more) computing platforms it is easy to lose track of the most recent versions of your datasets and scripts. Therefore, it is necessary to think of a workflow that allows efficient management of your data and scripts.

Also for experienced users it is good to think about handling data within HPC workflows from time to time. Software tools and technology for data management and transfer are rapidly evolving; tools that were available a few years ago are replaced by new tools leading to new opportunities, saving time and keeping your workflow efficient and clear.

This blog aims to provide insight on what a workflow could look like, and which software could be used for data synchronization between different computing platforms. It also provides a link to instructions for using these software tools.

Infographic, describing a central storage facility, from where you will transfer the data to, for example, a HPC system

Workflow

In order to keep a clear overview of the most recent versions of your data and scripts, it is recommended to use one central storage facility in your workflow, such as YODA, Surfdrive, or a data archive. When you do analyses, you can transfer data from the central storage facility to the HPC system and when your job is done you transfer the output back to storage. YODA (developed at the UU) is a data management environment that can be used as this central storage facility. YODA is built on state-of-the-art data storage and transfer technology and is suitable for a full range of users from researchers that just started working with large datasets and high performance computing to research groups that work with very large datasets. Yoda is not only a handy tool for long-term and secure storage of data; it is also possible to synchronize data to various platforms at very high transfer speeds. Alternatives are cloud storage platforms such as Surfdrive (SURFsara) and Google Drive (commercial), or Data-Archive (SURFsara).

Data transfer and synchronization; ‘drag and drop’ versus command line tools

For users that are only used to working with Windows or Apple computers, tools are available to connect to remote data storage as well as storage systems of HPC platforms and manually transfer files using ‘drag and drop’ principles. Open source tools in this category are Winscp and Cyberduck, or the more versatile MobaXterm, which can also be used for login in to HPC platforms. These tools are solid and intuitive for beginning users, but also somewhat slow which becomes relevant when you frequently need to transfer many gigabytes of data.

For users that are experienced with using command line and/or users that need to transfer large amounts of data, command line tools exist that are more efficient. The above-mentioned storage system YODA is typically accessed via ‘icommands’. Another very versatile tool that can be used in combination with many storage platforms is ‘Rclone’. These tools can be operated directly on the HPC platform and transfer commands can be incorporated in job scripts.

Instructions

At the ITS department of the UU, we have developed a guide on how to manage and transfer your data efficiently when working on HPC platforms. It describes several solutions for different use cases and evaluates the performance of those tools. You can find the guide on the Github repository of the University.

For questions and support contact Research IT via info.rdm@uu.nl.