Supporting data management services for an interdisciplinary, long-term research project with focus on soil, vegetation and atmospheric data

Authors: Constanze Curdt


Conference paper

Summary

This contribution presents research data management (RDM) services that were established within the DFG-funded interdisciplinary, long-term project Collaborative Research Centre/Transregio (CRC/TR) 32 ‘Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation’ since 2008. In this framework several services were set-up to support the involved scientists (e.g. PhD students, Postdocs) during the entire lifecycle of the research project. This includes the provision of general support and training for RDM. Moreover, a project-specific RDM system was established according to the needs and requests of the project and scientists. Common features such as data storage, backup, documentation search and access are provided, as well as data metrics. These features are developed in collaboration with the researchers and facilitate system acceptance.

Introduction

In the last decade, the importance of research data management (RDM) has increased in many research fields and disciplines. Since new methods and technologies have facilitated the rapid generation of (research) data and digital information, such as in geosciences and atmospheric sciences (Overpeck et al. 2011). Consequently, a proper storage and backup of the data, as well as sharing, re-use, and even open access to research data are getting more important. In particular, research data that have been conducted in the context of collaborative, interdisciplinary, long-term research projects is requested to be exchanged securely by the involved scientist. Thus, there is a need to establish appropriate infrastructures, which can serve cross-disciplinary demands (Klar and Enke 2013). Multi-disciplinary research projects, which focus on environmental field studies and modelling should particularly provide an appropriate RDM infrastructure (Mückschel et al. 2008). This should support the scientists during the entire research life cycle of the research project (e.g. data collection, storage, exchange, documentation). The system should be set-up according to the requests and needs of the involved scientists, as well as according to recent standards and principles (e.g. metadata standards). Moreover, data metrics show an increasing importance in the context of RDM (Whyte et al. 2014) and have to be considered for RDM services.

In this contribution, we will present the RDM practices provided for the interdisciplinary, long-term research project Collaborative Research Centre/Transregio (CRC/TR) 32 ‘Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation’, funded by the German Research foundation (2007-2018).

Project Background and Data

The CRC/TR32 ‘Patterns in Soil-Vegetation-Atmosphere Systems’ is an interdisciplinary, long-term research project between several research groups of the field of geosciences of the German Universities of Aachen, Bonn, and Cologne, as well as the Research Centre Jülich. The main research aim of the involved scientists is to yield improved numerical Soil-Vegetation-Atmosphere models to predict water, energy and CO2 transfer by calculating patterns at various temporal and spatial scales. This study is conducted within the catchment of the river Rur located in Germany, parts of the Netherlands and Belgium. To achieve the research goal of the CRC/TR32, various scientists from different disciplines are involved in various sub-projects, such as soil and plant sciences, geography, geophysics, hydrology, macromolecular chemistry, meteorology, mathematics, and remote sensing. The involved scientists of the CRC/TR32 have created a variety of research data since 2007 in different spatial and temporal scales. These heterogeneous data have been conducted from various field campaigns in the study area (e.g. meteorological or hydrological monitoring, airborne campaigns), laboratory studies (e.g. plant biomass measurements), or by following data modelling approaches. Moreover, all involved scientists have produced a variety of further data including publications, conference contributions and PhD reports.

Data management practices and services for the CRC/TR32

To support all involved scientist (e.g. PostDocs, PhD students, master or bachelor students) of the CRC/TR32 during the entire research life cycle of the research project, various RDM services (Figure 1) were established since the project start in 2007.

Figure 1: Research Data Management Services provided for CRC/TR32 members, modified after Curdt and Hoffmeister (2015).

Besides general RDM support and training for the scientist, one main issue and goal was the establishment of a RDM system, the so-called CRC/TR32 project database (TR32DB, www.tr32db.de). This self-designed system has been online since early 2008 and has been continuously developed further since its establishment. The TR32DB is set-up in cooperation with the Regional Computing Centre (RRZK) of the University of Cologne, were it is also hosted. The TR32DB has been developed according to the needs and requests of the scientists (e.g. huge file sizes, heterogeneous data and formats), as well as the requirements of the DFG. This includes for instance the re-use of the available infrastructure, cooperation with a computing centre or library, as well as the set-up of a user-friendly system. The TR32DB supports common features of a RDM system. This involves secure storage and backup (e.g. files by default up to 8GB) of all data created by the scientists (e.g. research data, publication, reports) including accurate data documentation. For this purpose, a project specific, standardized, multi-level metadata schema is used to enable the description of the various data types. Metadata of a dataset have to be provided by the user via the web-interface and a corresponding metadata-input wizard. The successful submission of the data including metadata to the TR32DB makes a datasets on-the-fly available and searchable via its metadata. The metadata of a dataset are always open accessible for every visitor of the website. The permission to download a dataset can be set by the data provider. This includes the option to make datasets only downloadable by the sub-project members, all TR32DB users with a login, as well as free download for everybody. By default, datasets in the TR32DB contain an internal data identification number and a corresponding permanent URL to access the metadata of the dataset. In addition, the data provider can apply a DOI for his dataset to make it, e.g. citable in a publication. Moreover, the TR32DB provides several web mapping applications, such as the map search for data using google maps or an internal web GIS for data visualisation purposes. In addition, an internal, secure data exchange platform is available for the TR32DB users, which enables the TR32DB users to share e.g. preliminary project data with colleagues of another sub-project. The latest improvement of the TR32DB is the data metrics available via the statistics menu of the website since February 2016. Some metrics of the TR32DB are open accessible. This mainly includes the visualisation of the number of datasets provided to the TR32DB such as assigning datasets according to the sub-project and datatype, the total amount of data in the system according to a specific theme or location. In addition, the most downloaded datasets of the system are presented. Moreover, further data metrics such as downloads or metadata views of a specific dataset are currently only available for TR32DB users. The successful usage of the RDM services is also clearly visible by the amount of project data stored in the TR32DB. As of May 2016, around 600 GB of data are stored in the data storage and 1,360 datasets are accessible by metadata via the TR32DB.

Discussion and Conclusion

The establishment of research data management services within interdisciplinary, long-term research projects such as DFG-funded Collaborative Research Centres is essential (Enke and Klar 2013), in particular for the achievement of the overall research goal of the project. In many cases, scientists of sub-projects are dependent upon the research results and data of other colleagues to answer their research questions. As already discussed by Mückschel et al. (2008), this applies for instance for projects in the earth and environmental sciences, where synergies have to be created between e.g. the data collectors in the field and the modellers. The establishment of services such as data storage, backup, documentation, search and access provided by a RDM system facilitates the exchange of the project data within the project funding, but also enables re-use of the data for future studies and by future project participants.

In this study, we have presented RDM services provided for the scientists of the CRC/TR32 project. In this context, the project-specific RDM system TR32DB was established that supports the project scientists during the entire research project life cycle (e.g. data storage, documentation, search, access). As recommended by Mückschel et al. (2008), the project scientist were involved in the RDM system and services design at an early stage. This has facilitated the usage of the established infrastructure. Consequently, one major result gained in the previous years is that RDM services will only be used by scientist if these are provided in compliance with their requirements. RDM systems should be designed according to the project background to fulfil the project challenges. Likewise they should be intuitive and simple to facilitate system acceptance and avoid overburdening of the scientists. Additionally, training and support of scientists (e.g. practical workshops, tutorial) is one main issue to promote the use and acceptance. The external and internal data metrics of the TR32DB show that the system is frequently used by TR32DB users (with login) for data upload, data sharing and data download. Moreover, internal statistics also show that TR32DB visitors (without login) access metadata and download open available data.

Acknowledgements

The authors would like to thank all colleagues involved in the design and implementation of the TR32DB. In addition, the authors gratefully acknowledge financial support by the CRC/TR32 ‘Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation’ funded by the German Research Foundation (DFG).

Competing Interests

The authors declare that they have no competing interests.

References

Curdt, C. and Hoffmeister, D. 2015. Research data management services for a multidisciplinary, collaborative research project: Design and implementation of the TR32DB project database. Program, 49, 494-512.

Klar, J. and Enke, H. 2013. Forschungsdaten in der Gruppendomäne – Zwischen individuellen Anforderungen und übergreifenden Infrastrukturen. Zeitschrift für Bibliothekswesen und Bibliographie (ZfBB), 60, 316-324.

Mückschel, C., Weist, C. and Köhler, W. 2008. Central data management in environmental research projects - selected problems and solutions. In: Müller, R. A. E., et al. (eds.) 28. Gil Jahrestagung. Kiel, Germany: Lecture Notes in Informatics.

Overpeck, J. T., Meehl, G. A., Bony, S. and Easterling, D. R. 2011. Climate Data Challenges in the 21st Century. Science, 331, 700-702.

Whyte, A., Molloy, L., Beagrie, N. and Houghton, J. 2014. What to Measure? Toward Metrics for Research Data Management. In: Ray, J. (ed.) Research Data Management: Practical Strategies for Information Professionals. West Lafayette, IN, USA: Purdue University Press.