Compilation of Recommendations

Details

Short Title

R 2: Identify what resources are to be marked up with structured data

Source Documnent

Guidelines for publishing structured metadata on the Web V3.0

Source Document Link

https://doi.org/10.15497/RDA00066

Publishing Organisation

RDA Research Metadata Schemas WG

Date of Publication

2021-06-15

Topic

Discovery/ indexing/ search, Interlinking/ interoperability, Machine-actionability

Addressed Stakeholders

data service providers, data stewards

Keywords

metadata

Text

More and more data repositories have metadata for not only datasets, but also other research resource objects such as software, models, instruments, samples, etc. These resources are essential for supporting open and reproducible research. Our analysis shows almost every research resource object has a corresponding class from Schema.org. The primary goal of publishing data to the web (or any other platform) is for wider discoverability; however, discoverability is often simply a means for data to be found and reused. One has to determine the necessary properties of a resource, and their relationships to other resources, i.e. data provenance information that helps data consumers to judge the reusability and quality of that resource. The W3C provenance Incubator Group defines provenance of a resource as: ‘a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility.’ (Gil et al., 2010) If we treat datasets as primary resources, according to the above definition and the provenance data model, then provenance information includes: where (e.g., location) and how (e.g., software, instrument, model, sensor) data is captured or produced, as well as who (person or organisation) has been involved in its generation, and for which purpose (e.g., project or grant) it was produced. The question then becomes, where should this provenance, these properties and relationships reside? For simple data discovery, basic metadata is often enough to lead users to the repository website where more information can be found. For many repositories this may be sufficient, at least as a first step. However, to facilitate advanced data discovery, assessment and access services, and to connect related resources wherever they may reside, we highly recommend publishing detailed dataset metadata connecting all related resources into a web of (distributed) data, thereby increasing discovery and access paths to all data. Another important step when defining what resources will be included as structured data is the persistent identification of those resources. An ‘Identifier’ is a very special property providing a unique identity that should be included for most, if not all, marked up resources (McMurry, Jut and Blomberg, et al., 2017). An identifier is used to name a resource or a thing uniquely (whether a digital resource or not), a persistent identifier (PID) is guaranteed to be managed and kept up to date over a defined time period. The PID issuing body should provide accompanying documentation to describe the level of persistence. Examples of persistent identifiers include Digital Object Identifiers (DOI), Persistent URL (PURL) based identifiers (eg. identifiers.org, OCLC PURL), and Universal Resource Name (URN) etc. PIDs can be used by both humans and machines to identify resources on the web, and, where applicable, resolve to that resource. Only the custodial repository should assign a persistent identifier. In situations where metadata records from a repository are harvested and published by multiple aggregators, using the same persistent identifier for a resource enables users to easily identify it as the same resource.