DATA SELECTION
Data selection
The two major use cases and drivers for what to keep are Research Integrity and Reproducibility (availability of the data supporting the findings in research); and the Potential for Reuse (availability of data for sharing with other users). | Beagrie, 2019
Can we afford not to preserve certain research data? That is the question that is central to the selection of research data for long-term archiving and data publication. Which research data do we archive for verification purposes only? And which datasets do we really make findable and reusable by publishing the (meta)data in a data archive? The criteria are discussed in this section.
Reasons for the retention of research data
There may be several reasons for retaining research data:
- The importance of the research data
Potential value for reuse and (inter)national positioning. Quality, originality, size, scale, production costs of the data or, for example, the innovative nature of the research. - The uniqueness of the data
The data include non-repeatable observations. - The importance of data for historical research
The data is important for historical research, especially scientific-historical research. - Other reasons
The research data is important for non-scientific purposes (cultural heritage, museums or presentations).
In addition to these general considerations, research funders such as the Netherlands Organisation for Scientific Research (NWO, n.d.) are increasingly making it compulsory for research data to be retained in order to make re-use possible. The Netherlands Code of Conduct for Research Integrity (VSNU, 2018) also obliges researchers to retain both raw and processed data for a period appropriate to the discipline and methodologies used.
Preconditions
The selection of research data is not only done on the basis of substantive arguments. In addition, there is a whole list of considerations and preconditions that contribute to the arguments for making the final decision. Consider, for example, the following:
Archive or publish
If the preconditions are met, it is important to decide whether you will:
- Archive the data for verification purposes or to keep open the possibility to use the data again in future research.
- Publish the data for reuse by (future) others in a data archive or institutional repository.
In the flow chart below, the arguments for making an informed choice are visualised in a simplified manner.
In the spotlight
More information about selecting research data can be found in the reports:
- 'What to keep: A JISC research data study' (Beagrie, 2019)
- 'Selection of Research Data, Guidelines for appraising and selecting research data’ (DANS, 2011)
- The Cabauw radar data in 4TU.Research Data (n.d.) is a clear example of data that meets the selection criteria. The radar data contain information about the influence of dust particles on cloud formation. These measurements can only be done once and provide valuable information about climate change. In addition to the processed data, raw data is also stored for these data. The argument for storing raw data is that it may contain information that we cannot extract from it yet.
- Interview projects can be classified under research which is difficult to repeat. Recordings of the personal experiences of, for example, the Second World War is often a matter of "now or never" due to the age of the interviewees. DANS has a lot of interview data in its collections Oral History (DANS, 2012) and World War II (DANS, n.d.) that are a valuable source for historical research, now and in the future. These interviews are kept behind the scenes in large format, to be regarded as the "raw data" and shown as MP4 via DANS Data Station Social Sciences and Humanities.
- Also, for example, the data that is now being collected at the Large Hadron Collider (particle accelerator), can't afford to be lost (CERN, n.d.).
A student left the following comment making clear that the considerations for retention aren't always straightforward.
"I have an example that does not fit in with the mentioned cases, and for which is difficult to find the optimal solution. We perform experiments that produce massive amounts of data. The experiments are difficult and expensive, suggesting that it is a good idea to store this raw data. However, the data is not usable in the original format and needs to be preprocessed, which greatly reduces its quantity. The preprocessed data is used for our analyses and publications, so if colleagues want to verify our data, they would also need our preprocessed data sets. It therefore seems more sensible to store this data for the long term, also with respect to the costs of storage. In addition, we expect the data acquisition to continuously improve in quality. So, in five years or less the raw data we now have may be very inferior to what we can record in the future. However, the preprocessing algorithms are also developing, and other researchers might be more interested in applying these to our datasets. Moreover the experiments we have performed are unlikely to be redone in the future because of the costs involved". | Chris van der Togt, 2018
RDM Support at Utrecht University in the Netherlands offers two how-to-guides for:
- Archiving data (Utrecht University, n.d.a.)
- Publishing data (Utrecht University, .n.d.b.)
4TU.ResearchData (n.d.). Atmospheric Observation Collection Cabauw. Retrieved from https://data.4tu.nl/collections/Atmospheric_observations_IDRA_Cabauw/5065367
Beagrie, N. (2019). What to Keep: A Jisc research data study. http://repository.jisc.ac.uk/7262/1/JR0100_WHAT_RESEARCH_DATA_TO_KEEP_FEB2019_v5_WEB.pdf
CERN (n.d.). CERN Open data portal. http://opendata.cern.ch/
DANS (2012): Thematische collectie: Oral History. https://doi.org/10.17026/dans-z3c-f26d
DANS (n.d.). Collectie Tweede Wereldoorlog. https://ssh.datastations.nl/dataverse/root?q=&fq1=dansCollection_ss%3A%22https%3A%2F%2Fvocabularies.dans.knaw.nl%2Fcollections%2Fssh%2F213a59ea-8d36-42c6-b30b-474ebbdef61d%22&fq0=dvObjectType%3A%28dataverses+OR+datasets%29&types=dataverses%3Adatasets&sort=dateSort&order=#
Gibney, E. (2013, November 26). LHC Plans for open data future. Nature News. Retrieved from http://www.nature.com/news/lhc-plans-for-open-data-future-1.14244
NASA. (2011). Astronomers find elusive planets in decade old hubble-data. Retrieved from http://www.nasa.gov/mission_pages/hubble/science/elusive-planets.html
NWO (n.d.) Open science. https://www.nwo.nl/beleid/open+science
Tjalsma, H.; Rombouts, J. (2011). Selection of research data - Guidelines for appraising and selecting research data. Retrieved from from http://www.dans.knaw.nl/nl/over/organisatie-beleid/publicaties/DANSselectionofresearchdata.pdf
Utrecht University (n.d.a.). Storing and preserving data. RDM Support. [Guide]. https://www.uu.nl/en/research/research-data-management/guides/storing-and-preserving-data
Utrecht University (n.d.b.). Publishing and sharing data. RDM Support. [Guide]. https://www.uu.nl/en/research/research-data-management/guides/publishing-and-sharing-data
VSNU (2018). Nederlands gedragscode wetenschappelijke integriteit. https://doi.org/10.17026/dans-2cj-nvwu.