Storing and preserving data
This guide provides several good practices in the storage of research data during data collection and in preserving your data after your project is finished.
You can start by watching this video about preserving research data in an optimal and technically correct way. This tutorial is part of the online training 'Learn to write your DMP'.
1. GOOD PRACTICES IN STORING DATA
Storing your data properly can save you a lot of time (in finding and interpreting) and frustration (in not losing it). Moreover, when properly structured and annotated during research, you’ll have your data preserved and/or shared with minimal effort at the end of your research. To properly store your data, consider the following:
Not all storage locations are equally suitable for all types of storage:
- Portable devices are suitable for holding short term copies of your data file for transport, but they are vulnerable for loss and there is no automatic back-up of your data. If you use portable devices, make sure you choose high quality products from reputable manufacturers, regularly check the media to make sure that they are not failing and periodically 'refresh' the data (that is, copy to a new CD, disk, or USB flash drive).
- Cloud services are suitable for collaboration with partners from outside the university, with the added benefit that they are not device specific. Be sure to check if your selected cloud service makes regular back-ups, if it falls under European jurisdiction and if it is a trustworthy partner. If not, this particular cloud service could be a bad idea.
- The university network drives are suitable if you collaborate with others within the university and are also not device specific. The back-ups here are made automatically and very regular.
- Yoda (Your Data) is a data management solution developed by Utrecht University for reliable storage and preservation of large amounts of research data during all stages of a research project. Advanced access control, data policy management and replication of the data to other geographic locations, are possible with Yoda which make this a suitable solution for sensitive data.
See 'Tools for storing and managing data' for an overview of the tools Utrecht University develops, supports and endorses.
In your day-to-day research make sure you manage the different versions and copies of your data carefully in the following ways:
- Protect raw data
Your raw data, as you have collected it or have received it, is the basis of all analyses that you plan. With the raw data and the recorded steps of your analyses, you can retrace all of your results. So, it is important that raw data is not accidentally overwritten or changed. Store it in a separate, protected location, for instance a separate folder that is set to ‘read only’. Make a working copy of your raw data to do your actual analyses on. To check if the raw data is still the same, consider checking whether the data that you currently store is the same as the orginal. You can check data integrity with a checksum-checker or the md5 sum (for MAC or UNIX users this is provided by the operating system).
- Keep temporary and master copies apart
Your working files are frequently changing. Imagine you have several copies at different locations. How do you keep track of what copy contains the updates you most recently made? If you choose the wrong file, it takes time to merge both documents afterwards. To avoid confusion, you select ONE place where the master copies of your work are located. All other copies are temporary and should be placed back or synchronised with the master copy location, at regular intervals, fixed times, or after each edit.
- Back-up your master copy in physically distinct locations
If there is a calamity or accident at your master copy location, all your work could be lost. It is important to have back-ups of your master data files, including one at a separate location. Back-ups are logically made from the master copy location, which should hold the most recent and correct version. Do not overwrite old back-ups; make a new one and delete the old, if necessary.
There are several back-up schemes to choose from. What you choose depends on how much time a back-up takes, how much space you have, whether it is costly, and what your risk is to lose important information between back-ups. You either always do a full back-up of all files, or partial back-ups. Consider backing-up important or dynamic data more often. In case of large file sizes, you could decide to back-up only the most essential elements.
Some master copy locations provide automatic back-up. In that case, at least inform yourself on the scheme used. Also, make sure that the back-up location is as secure as your master copy location. Moreover, check if the time and effort needed to restore a back-up copy is acceptable for you and strategically retain back-ups for prolonged time.
- Set up a strategy for version control
Versioning ensures tracking the development of a data file and identifying earlier versions, when needed. The simplest way to identify a particular version is to add an extension to the file name such as 'v1.00', 'v1.01', 'v2.06' with ordinal numbers indicating major and decimals minor changes. As long as the original 'raw' and definitive copy are retained and processing is well documented, the intermediate working files can be discarded. Keep only the major versions for longer term retention. In a version control table (or file history or log file) you can document what is new or different in each major version that you keep.
As your work progresses, it is likely that you’ll have more and more files, all with different content. Finding the exact file that you need can be a hassle if you do not have a logical folder structure or logical file names. Think on naming conventions and folder structure before you start a project. It is easier to maintain a manageable number of files and versions with a clear naming and folder structure and can save you a lot of frustration. If files are to be shared in a shared file space, standardised file-naming conventions are even more important. Think about the:
- Folder structure
Before you start your project, think of a logical folder structure. Anticipate the kind of files you will produce and envision folders for those files. Not too flat, not too deep. About three steps down is workable. Make it stable and scalable, so you can expand without having to rearrange the structure completely. Don't use folders with possibly overlapping contents on the same level.
A well-arranged folder structure in which folders and sub-folders are hierarchical and follow each other logically, is invaluable in quickly navigating your data and finding what you require. It can be very helpful to draw up your folder structure in a diagram in your DMP.
- File naming
Some do's and dont's:
- Employ clear file names. Build your file names from elements. Elements could be project name, project number, name of research team/department, measurement type, subject, date of creation, version number, etc. Each element is coded to keep names short.
- Keep file names short. About 25 characters is a good length for a filename.
- Keep a log file where you explain your coded elements, so outsiders, collaborators, supervisors, or yourself in a years’ time, will be able to crack the codes. Your data management plan is a good place to document your file naming conventions.
- Always go from generic to specific. This will help you find sets of files with a simple sorting of filenames in your folder.
- Only use characters from the sets A-Z, a-z, 0-9, hyphen, underscore, and dot. Don't use special characters such as &%$#), as different operating systems can assign different meaning to those characters. An example of a file name could be ‘NTC_wp5_MA_exp1.csv’ (project, work package within the project, type of measurement, experiment ID of the measurement) or MicroArray_NTC023_20141031.xls (content description, project number, date: international standard).
- Ensure file names are independent of location (this will avoid problems when moving files).
Types of metadata and data documentation
Documentation (human readable) and metadata (standardised, fixed fields that can take a value, computer readable) both provide information about the data at hand. Both can be used to describe the subject of the measurements or the settings/circumstances under which these were obtained. A minimum set of documentation and metadata could be anything you need to interpret and evaluate the measurements. An extended set could be anything others might find valuable. There are roughly three goals to use specific types of metadata and data documentation:
- Finding and reusing data
- Descriptive metadata
E.g, author, contributor, title, abstract, keywords, measurement type, project ID, geomapping, time period, subject area.
- Descriptive documentation
E.g. software scripts, instrument settings, methodology, experimental protocol, codebook, laboratory notebook.
- Descriptive metadata
- Managing your data
- Administrative metadata
E.g. data format, date, size, access rights, preservation period, persistent identifier (PID, to cite your data), license.
- Administrative documentation
E.g. user agreements, provenance (description of the origin of the data).
- Administrative metadata
- Understanding the context of your data and files
- Structural metadata
E.g. related content, related projects, version.
- Structural documentation
E.g. database scheme, relations between files, table of content.
- Structural metadata
Documenting data with metadata sheets
Your (raw) data may consist of several files with measurements (or interviews/observations/samples/etc.). A file name can only hold so much information. Having a metadata table (or sheet) that holds information on your data files can give you a quick overview of what measurements you have in your data files, so you don't have to open each of the files to see and interpret the content. See 'Data description in practice' for more specific guidance and tips or watch this tutorial 'The ins and outs of metadata and data documentation'.
File formats refer to the form in which data is stored. The format is indicated by the file extension at the end, such as .wmv, .mp3, or .pdf. Not al formats are equally widely accessible or future-proof. For enabling access and use of your data by others, use a standard format for your stored files. The following characteristics will help to ensure access:
- open documentation;
- supported by many software platforms;
- wide adoption/common usage;
- no (or lossless) compression;
- no embedded files or scripts.
At DANS a distinction is made between preferred and acceptable formats for deposits for data preservation and data sharing. Note that if you have to convert your file format into another to share the data with others, important information could be lost during conversion. If possible, work in the standard format from the start. Having the data available in a standard format after your research project ends will increase the possibilities for reuse.
When you need to share your data during research take into account the wishes of rightful claimants to the data (research subjects, co-authors, partners from industry, etc.) and make sure you are compliant to relevant legislation (See laws and codes of conduct for 'sharing privacy-sensitive data').
Learn about different measures depending on the kind of security you need:
- Protection of data files
The information in data files can be protected by:
- controlling access to restricted materials with encryption. By coding your data, your files will become unreadable to anyone who does not have the correct encryption key. You may code an individual file, but also (part of) a hard disk or USB stick;
- procedural arrangements like arranging access conditions in a consortium agreement and, if necessary, through non-disclosure agreements with participants and data handlers (See the guide on 'Legal instruments and agreements');
- not sending personal or confidential data via email or through File Transfer Protocol (FTP), but rather by transmitting it as encrypted data (e.g. via SURFfilesender);
- destroying data in a consistent and reliable manner when needed. Note that deleting files from hard disks only removes the reference to it, not the file itself. Overwrite the files to scramble their contents or use secure erasing software. For USB and CD/DVD, physical destruction works best to erase data.
- Computer system security
The computer you use to consult, process and store your data, can be secured in the following ways:
- Use a firewall to protect your data from viruses;
- Install anti-virus software;
- Install updates and upgrades for your operating system and software;
- Only use secured wireless networks;
- Use passwords and do not share them with anyone. Do not use passwords on your university computer only, but also on your laptop or home computer. If necessary, secure individual files with a password;
- Do not provide others with your login credentials.
- Physical data security
With a number of simple measures, you can ensure the physical security of your research data:
- Lock your computer when leaving it, even if it is just for a moment (Windows key + L);
- Lock your door if you are not in your room;
- Keep an eye on your laptop;
- Do not leave unsecured copies of your data lying around;
- Transport your USB stick or external hard disk in such a way that you cannot lose it;
- Keep non-digital material which should not be seen by others, in a locked cabinet or drawer.
- Pay specific attention to the protecting of privacy-sensitive data
In the guide 'Handling personal data' we will go into more detail (you can skip step 6, which repeats this section).
2. Good practices in preserving data
The Netherlands Code of Conduct for Research Integrity (VSNU, 2018) states that research data must be kept for (at least) 10 years. The Utrecht University Policy Framework for Research Data adds that this 10 year period starts after you have published your paper based on the data you are preserving. For medical records, this period is 15 years or longer (WGBO (article 454)) and (patient) data for drug research must be stored for 20 years. For the UMC Utrecht Research Data Management policy, see the UMCU intranet page. The AVG/GPDR states that personal data may not be kept longer than is necessary for the purposes for which they were collected or for which they are used. Non-anonymised data may, however, be preserved for historical, statistical or scientific purposes.
So, how will you keep your data safe for the long term? Having a solution for preserving data for the long term can be pursued in several ways. Preservation can be done on tape, disk, or via cloud storage. You can use a commercial solution, or ask Research Data Management Support to set up an archive. You can use a free, public repository for research data, with added possibilities for sharing your data (See our guide 'Publishing and sharing data') or you can preserve the data yourself. If you choose the latter option, some best practices are provided here:
Which data will you select for preservation? Choices are, for instance:
- Will you only preserve the data underpinning a scientific publication, or also other data?
- Will you preserve the data once it is completely static (no alterations expected) or will you allow for versions?
- Is the location where you decide to store your data appropriate for preserving personal (privacy-sensitive) data?
- At what time and according to which specifications will your data be removed?
What exactly to preserve also depends on your purpose:
- If you preserve data mainly for verification purposes, preferably intermediate results or materials and methods of analyses are also stored along with the workflow.
- If the data is stored because it might be reusable in the future by yourself or by others, your focus will be to preserve it in a way to make new analyses possible. In this cases you should store the data as raw as possible.
In both cases, enough documentation needs to be added to make the data comprehensible.
EXTERNAL, THIRD PARTY DATA
If you have made use of data from other parties, you will have to account for them as well. You have two options:
- Arrange with the owners that they store the data and make it available for verification purposes, for at least the obligatory period of storage (ten years). You can then simply refer to their storage.
- Try to arrange a local copy, that you yourself can store for the required period.
You should store all your data and documentation files together in a data package. For verification, all documentation and data (raw or possibly analysed) that enable research replication must be provided. For sharing, data should be stored as raw as possible (if usable in that form) along with documentation to help comprehend and reuse it. In both of these cases you should include:
- A variable list or code book explaining the variables in your data.
- If applicable, the computer code used to perform analyses and/or an explanation of performed analyses ('methods');
- A file which describes the files in the data package and their relation should be provided.
Once your data is preserved, it is used as a reference point. You have to prevent that someone can, willingly or not, overwrite your files, delete them or change their contents. Possible measures are:
- If possible, log all visits to your data.
- Prevent unwanted visits to your data by controlling access to it, e.g. by setting up a password, using encryption, and/or physical restrictions (i.e. a vault). Also see 'VI. Secure your data files' above.
- Prevent overwriting, deleting, or meddling by making your files ‘read only’.
- If there are essential updates to the data, preserve a new version in addition to the old, but do not change the original version.
Unfortunately, data can become unusable in due times because:
- digital sources degrade over time ('bit rot');
- data and software can become outdated. For instance, a new software version is not compatible with your data format or a new operating system does not support your software;
- the media on which data is stored becomes outdated (e.g. floppy disks or audio cassettes);
- media on which you store your data (hard drives, cd, usb, etc.) can become faulty;
- the data has simply gone into oblivion.
Determine who can access (part of) your data. In which cases do you allow for access? What are the privileges that each person gets in the different cases that you can foresee?
Your data has to be protected and be available for verification purposes after your research project has finished. However, you may not be around during your whole research career to provide the appropriate care. Therefore, roles and responsibilities should be written down in an archival policy, making clear who is responsible for doing what with your data in the long term. It's best to put 'roles' (such as the data manager, supervisor, dean, principal investigator, etc.) in the archival policy and not 'persons' as persons may leave the university.
3. IT-solutions for storing and preserving data
See 'Tools for storing and managing data' for an overview of the tools Utrecht University has developed, supports and endorses.