Data selection


The two major use cases and drivers for what to keep are Research Integrity and Reproducibility (availability of the data supporting the findings in research); and the Potential for Reuse (availability of data for sharing with other users). |  Beagrie, 2019



Can we afford not to preserve certain research data? That is the question that is central to the selection of research data for long-term archiving and data publication. Which research data do we archive for verification purposes only? And which datasets do we really make findable and reusable by publishing the (meta)data in a data archive? The criteria are discussed in this section.


Reasons for the retention of research data

There may be several reasons for retaining research data: 

  • The importance of the research data
    Potential value for reuse and (inter)national positioning. Quality, originality, size, scale, production costs of the data or, for example, the innovative nature of the research.
  • The uniqueness of the data
    The data include non-repeatable observations.
  • The importance of data for historical research
    The data is important for historical research, especially scientific-historical research.
  • Other reasons
    The research data is important for non-scientific purposes (cultural heritage, museums or presentations).

In addition to these general considerations, research funders such as the Netherlands Organisation for Scientific Research (NWO, n.d.) are increasingly making it compulsory for research data to be retained in order to make re-use possible. The Netherlands Code of Conduct for Research Integrity (VSNU, 2018) also obliges researchers to retain both raw and processed data for a period appropriate to the discipline and methodologies used.

Preconditions

The selection of research data is not only done on the basis of substantive arguments. In addition, there is a whole list of considerations and preconditions that contribute to the arguments for making the final decision. Consider, for example, the following:


In which formats are the data available? Is the data format and software format usable? For (re)usability, data should preferably be stored in sustainable data formats.


What is the processing phase of the data? Raw/unprocessed, semi-processed or published? 


Is enough metadata and data documentation available? Is the information of sufficient quality to understand what the data is all about? 


Does clarity exist about intellectual property rights, such as copyright or database rights? Are personal data involved? Can they be archived or published as such or are additional measures required?


Is there a sustainable infrastructure available for archiving or publishing the data? Think of a data archive or an institutional or thematic repository.


Are the costs of selecting, archiving, converting, storing and making data available for reuse taken into account? Whether data is archived for the long term remains a consideration of costs and benefits. How do the costs of archiving or publishing relate to the costs of reproducing the research data?



Archive or publish

If the preconditions are met, it is important to decide whether you will: 

  • Archive the data for verification purposes or to keep open the possibility to use the data again in future research.
  • Publish the data for reuse by (future) others in a data archive or institutional repository.

In the flow chart below, the arguments for making an informed choice are visualised in a simplified manner.




In the spotlight


More information about selecting research data can be found in the reports:



  • The Cabauw radar data in 4TU.Research Data (n.d.) is a clear example of data that meets the selection criteria. The radar data contain information about the influence of dust particles on cloud formation. These measurements can only be done once and provide valuable information about climate change. In addition to the processed data, raw data is also stored for these data. The argument for storing raw data is that it may contain information that we cannot extract from it yet.
  • Interview projects can be classified under research which is difficult to repeat. Recordings of the personal experiences of, for example, the Second World War is often a matter of "now or never" due to the age of the interviewees. DANS has a lot of interview data in its collections Oral History (DANS, 2012) and World War II (DANS, n.d.) that are a valuable source for historical research, now and in the future. These interviews are kept behind the scenes in large format, to be regarded as the "raw data" and shown as MP4 via DANS Data Station Social Sciences and Humanities.
  • Also, for example, the data that is now being collected at the Large Hadron Collider (particle accelerator), can't afford to be lost (CERN, n.d.).


A student left the following comment making clear that the considerations for retention aren't always straightforward.

"I have an example that does not fit in with the mentioned cases, and for which is difficult to find the optimal solution. We perform experiments that produce massive amounts of data. The experiments are difficult and expensive, suggesting that it is a good idea to store this raw data. However, the data is not usable in the original format and needs to be preprocessed, which greatly reduces its quantity. The preprocessed data is used for our analyses and publications, so if colleagues want to verify our data, they would also need our preprocessed data sets. It therefore seems more sensible to store this data for the long term, also with respect to the costs of storage. In addition, we expect the data acquisition to continuously improve in quality. So, in five years or less the raw data we now have may be very inferior to what we can record in the future. However, the preprocessing algorithms are also developing, and other researchers might be more interested in applying these to our datasets. Moreover the experiments we have performed are unlikely to be redone in the future because of the costs involved". | Chris van der Togt, 2018




RDM Support at Utrecht University in the Netherlands offers two how-to-guides for: 



The next section contains an infographic with RDNL services for archiving and publishing data.




Sources
Click to open/close

4TU.ResearchData (n.d.). Atmospheric Observation Collection Cabauw. Retrieved from https://data.4tu.nl/collections/Atmospheric_observations_IDRA_Cabauw/5065367

Beagrie, N. (2019). What to Keep: A Jisc research data study. http://repository.jisc.ac.uk/7262/1/JR0100_WHAT_RESEARCH_DATA_TO_KEEP_FEB2019_v5_WEB.pdf

CERN (n.d.). CERN Open data portal. http://opendata.cern.ch/

DANS (2012): Thematische collectie: Oral History. https://doi.org/10.17026/dans-z3c-f26d

DANS (n.d.). Collectie Tweede Wereldoorlog. https://ssh.datastations.nl/dataverse/root?q=&fq1=dansCollection_ss%3A%22https%3A%2F%2Fvocabularies.dans.knaw.nl%2Fcollections%2Fssh%2F213a59ea-8d36-42c6-b30b-474ebbdef61d%22&fq0=dvObjectType%3A%28dataverses+OR+datasets%29&types=dataverses%3Adatasets&sort=dateSort&order=#

Gibney, E. (2013, November 26). LHC Plans for open data future. Nature News. Retrieved from http://www.nature.com/news/lhc-plans-for-open-data-future-1.14244

NASA. (2011). Astronomers find elusive planets in decade old hubble-data. Retrieved from http://www.nasa.gov/mission_pages/hubble/science/elusive-planets.html

NWO (n.d.) Open science. https://www.nwo.nl/beleid/open+science

Tjalsma, H.; Rombouts, J. (2011). Selection of research data - Guidelines for appraising and selecting research data. Retrieved from from http://www.dans.knaw.nl/nl/over/organisatie-beleid/publicaties/DANSselectionofresearchdata.pdf

Utrecht University (n.d.a.). Storing and preserving data. RDM Support. [Guide]. https://www.uu.nl/en/research/research-data-management/guides/storing-and-preserving-data

Utrecht University (n.d.b.). Publishing and sharing data. RDM Support. [Guide].  https://www.uu.nl/en/research/research-data-management/guides/publishing-and-sharing-data

VSNU (2018). Nederlands gedragscode wetenschappelijke integriteit. https://doi.org/10.17026/dans-2cj-nvwu.