Preserving data
How will you keep your data safe for the long term? Preserving data for the long term can be pursued in several ways, for example on tape, disk, via a cloud solution or in a data repository. If you choose to preserve the data yourself, some good practices are provided here.
The data policy of Utrecht University (UU) states that research data, from a published paper, must be kept for at least 10 years, starting after the scientific publication. For medical records, this period is at least 15 years (WGBO, article 454) and (patient) data for drug research must be stored for at least 20 years.
With regards to personal data, the General Data Protection Regulation (GPDR) states that “personal data may not be kept longer than necessary for the purposes for which they were collected or used”. However, they may be preserved for historical, statistical or scientific purposes, for example for verification of the study results.
Consider the following questions when determining which data you will preserve:
- Will you only preserve the data underpinning a scientific publication, or also other data?
- Will you preserve the data once it is completely static (no alterations expected) or will you allow for different versions?
- Do you reasonably plan to remove the data after 10 years or sooner?
- For what purpose are you preserving the data?
- If you preserve data mainly for verification purposes, preferably intermediate results or materials and methods of analyses are also stored along with the workflow.
- If the data is stored because it might be reusable for other scientific purposes, you should preserve it in a way to make new analyses possible. In this cases you should store the data as raw as possible. See the Publishing and Sharing Data guide for more information
If you have used data from other parties, you will have to account for them as well. You have two options:
- Arrange with the owners that they store the data and make it available for verification purposes, for at least the obligatory period of storage. You can then simply refer to their storage.
- Try to arrange a local copy, that you yourself can store for the required period.
In all cases, documentation needs to be added to the data to make it comprehensible.
You should store all data and documentation together in a data package. For verification, all documentation and data that enable replication must be provided. For sharing, data should be stored as raw as possible (if usable in that form) along with documentation to help understand and reuse it. In both of these cases you should include:
- The data to preserve, e.g., raw, analyzed or otherwise useful data to preserve.
- As much documentation as relevant for the purpose of preservation, for example:
- a file containing administrative information, e.g., authors, project title, funder, date, keywords, start and end dates, geographical location, access conditions, terms of use, etc.
- information on the methods, e.g., which cases were excluded, how were data analyzed, etc. A reference to an (open access) published manuscript is sufficient as well.
- a README file describing the files in the data package and their relation.
- a variable list or codebook explaining the variables in your data.
- Research materials used during and after data collection, if relevant:
- (description of the) materials used to collect the data (e.g., tasks, scripts, etc.).
- (description of the) materials used to process the data (e.g., scripts to preprocess, clean, analyze the data).
Once your data is preserved, it is used as a reference point. You have to prevent that someone can, willingly or not, overwrite your files, delete them or change their contents. Possible measures are:
- Control access to your data, e.g. by setting up a password or using encryption.
- Make the files “read-only". This can be done in Yoda by “vaulting” the data.
- If there are essential updates to the data, preserve a new version in addition to the old, but do not change the original version.
Data can become unusable over time:
- Use a verifier (e.g., MD5) to check whether the data has degraded ('bit rot');
- Data and software can become outdated. For instance, a new software version is not compatible with your file formats or a new operating system does not support your software. To prevent this, you can:
- Bring data to a new format every once in a while
- Store the data together with software dependencies, for example in a virtual machine, Docker container or Docker image
- To prevent the media on which you store your data from becoming outdated (e.g. floppy disks or audio cassettes) or faulty (hard drives, CD, USB, etc.):
- Make regular back-ups
- Bring data to a new media every once in a while
- To prevent the data from ending in oblivion, you can assign someone to be responsible for the data and/or publish the (meta)data in a data repository.
Determine and document beforehand who can access (part of) your data and under which conditions. What are the privileges that each person gets in the different cases that you can foresee?
If your data are not published in a data repository, it is important to assign someone who will be responsible for managing your data. Preferably write responsibilities down in a policy, making clear who is responsible for doing what with your data in the long term. This may be done at the group or department level. It's best to put 'roles' (such as the data manager, supervisor, principal investigator, etc.) in the archival policy and not 'persons' as persons may leave the university.
See 'Tools for storing and managing data' for an overview of the tools Utrecht University has developed, supports and endorses.
Do you need support or assistance? Please contact us. We are there to help you.