Improving NCI metadata connectivity using RD-Switchboard

Authors: Jingbo WANG, Amir Aryani, Wei Si, Melanie Barlow, Ben Evans, Lesley Wyborn  


Conference paper

Summary

Making research data connected, discoverable and reusable are some of the key enablers of the new data-intensive revolution in research. Using the Research Data Switchboard (RD-Switchboard) (http://www.rd-switchboard.org/) on the Australian National Computational Infrastructure (NCI) data collections metadata catalogue, we show how connectivity graphs can promote a possible solution to machine-actionable literature searches to discover links between reseachers, publications and datasets. We show how RD-Switchboard was used to detect some errors in the catalogue, and therefore improved the quality of the information.

Introduction

Since 2013, the National Computational Infrastructure (NCI) at the Australian National University (ANU) has established a High Performance Data (HPD) service which manages over 10 PB of significant Earth Systems and Environmental national and international reference data (Evans et al, 2014, Wang et al, 2014), including data collections from NCI’s partners (Australian National University, CSIRO, Bureau of Meteorology, and Geoscience Australia). This data is co-located with High Performance Computing (HPC) and data analysis capabilities, and made available both locally and via data services through the National Environmental Research Data Interoperability Platform (NERDIP1) (Evans et al, 2015, Wyborn and Evans, 2015).

To support the overall management, a Data Management Plan (DMP) has been developed to document the data generation and storage workflows, as well as the key contacts and their responsibilities for each data collection. The DMP has attributes that enhance discovery and these are readily exported to the ISO19115 schema for publication through NCI’s GeoNetwork catalogue2. However, metadata incorporated into XML files is not the best format to answer research impact questions like, “How many datasets published at NCI has being referenced in research journal articles and which articles?”; “How many researchers and institutes are connected to a given dataset?”; “What are connections between the derived data products and source reference data at NCI, and who generates those derived data products and who uses them?” Hence, NCI incorporated the Research Data Switchboard (RD-Switchboard3) software to help visualize and document the connectivity among researchers, datasets, and publications.

RD-Switchboard is an open and collaborative software solution initiated by the Data Description Registry Interoperability (DDRI4) working group of the Research Data Alliance (RDA). This working group had input from the Australian National Data Service (ANDS), Dryad, CERN and number of international partners. RD-Switchboard connects datasets on the basis of co-authorship or other collaboration models such as joint funding and grants. This is similar to the “SEE ALSO” functionality in online movie services, app stores or book stores, where customers are invited to look at other products by the same author, related topics or similar publishers.

Technical Implementation

The RD-Switchboard has several technical components (see https://github.com/rd-switchboard). NCI’s implementation is installed within virtual machines (VMs) on NCI’s OpenStack cloud infrastructure. The metadata ingest workflow that helps describe these components is shown in Figure 1.

Figure 1: Data updating process flowchart.

The RD-Switchboard harvester collects XML files of all metadata records and stores them into its internal repository, that are then imported into its Neo4j graph database, which is ultimately used by the RD-Switchboard inference engine5.. NCI maintains this XML harvest and import process using an automated tool, which triggers the harvest and import process for any updated metadata information. Due to the amount of information in NCI’s catalogue, the RD-Switchboard has been implemented to harvest in a scalable way from multiple databases and metadata is harvested and imported on a regular basis to synchronise with changes. A series of database snapshots are backed up so that progressing changes on the graph database can be shown in a chronological way. The Neo4j browser6 then provides a query interface to readily identify connections within the graph database. NCI has implemented some fail-safe procedures around this ingest process to protect the information in operations.

We note that Universal Unique Identifiers (UUIDs) are required on all elements in order to track both internal connections as well as references for external data and information sources. Where possible, well-known identifiers such as URI, DOI and ORCID are used to identify these elements.

Connections Identified in the Graph Database

RD-Switchboard builds connections between researchers, datasets, organisations and publications (see Figure 2).

Figure 2. Connection screenshot of initial metadata database.

The connections show the relationship between researcher, publication and dataset. The connection report is a comprehensive reference for researchers to understand who uses their data, and what research has been published using this data. We believe that this programmatic connection through identifiers will free up significant time over trying to find similar information using traditional manual approaches and literature searches.

The connection report can also help determine the impact of a dataset through the total number of its connections – the more connections a dataset has, the higher the relevance it has within the research community. Through analyzing the connections to datasets, it is also possible to identify high value datasets to researchers and organisations, and help measure the impact that these datasets have had in the published literature.

The analysis of the connections graphs can also be highly effective for checking key citation information in the metadata catalogue. NCI has used the unique keys in the graph database to identify faulty metadata entries, which would otherwise been difficult to detect through manual checking of tens of thousands of records. For example, one institute was easily shown to be incorrectly recorded in the person’s key, which when investigated was found to be due to the institute being recorded into the wrong field in the original metadata.

Acknowledgements

The NCI High Performance Data Node is supported by the Australian Government NCRIS funded Research Data Storage Infrastructure (RDSI) and Research Data Services (RDS) projects and the RDA Data Description Registry Interoperability (DDRI) Working group.

Competing Interests

The authors declare that they have no competing interests.

References

Evans, B J K, Wyborn, L, Lewis, A, Foster, F, Minchin, S, Pugh, T, Uhlherr, A, Evans, B J 2014 Computational Environments and Analysis methods available on the NCI HPC & HPD Platform. American Geophysical Union Fall meeting, San Francisco, USA, December 13-17, 2014. http://adsabs.harvard.edu/abs/2014AGUFMIN53E..01E [Last accessed 20 July 2016]

Evans, B, Wyborn, L, Pugh, T, Allen, C, Antony, J, Gohar, K. Porter, D, Smillie, J, Trenham, C, Wang, J, Ip, A, and Bell, G 2015 The NCI High Performance Computing
and High Performance Data Platform to Support 
the Analysis of Petascale Environmental Data Collections. In: Environmental Software Systems, Infrastructures, Services, Applications, Denzer, I.R, Argent, R M, Schinak, G, Hrebicek, J (eds) FIP AICT 448, pp. 569–577, 2015.

Wang, J, Evans, B, Bastrakova, I, Ryder, G, Martin, J, Duursma, D, Gohar, K, Mackey, T, Paget, M, Siddeswara, G and Wyborn, L 2014 Large-Scale Data Collection Metadata Management at the National Computation Infrastructure. American Geophysical Union Fall meeting, San Francisco, USA, December 13-17, 2014. https://agu.confex.com/agu/fm14/webprogram/Paper22170.html [Last Accessed 20 July 2016]

Wyborn, L and Evans, B 2015 Integrating ‘Big’ Geoscience Data into the Petascale National Environmental Research Interoperability Platform (NERDIP): successes and unforeseen challenges. IEEE Big Data Conference 2015 Workshop on Big Data in the Geosciences. Santa Clara, California. http://dx.doi.org/10.1109/BigData.2015.7363981. [Last Accessed 20 July 2016]

Aryani, A 2016 Data Description Registry Interoperability WG: Interlinking Method and Specification of Cross-Platform Discovery. Research Data Alliance. http://dx.doi.org/10.15497/RDA00003 [Last Accessed 20 July 2016]

Notes

  1. http://nci.org.au/systems-services/national-facility/nerdip/

  2. http://geonetwork.nci.org.au

  3. http://rd-switchboard.nci.org.au

  4. https://rd-alliance.org/groups/data-description-registry-interoperability.html

  5. https://github.com/rd-switchboard/Inference

  6. https://github.com/rd-switchboard/Neo4j-Browser