Compilation of Recommendations

Details

Short Title

R 3: Adopt or develop a crosswalk from your repository schema to Schema.org

Source Documnent

Guidelines for publishing structured metadata on the Web V3.0

Source Document Link

https://doi.org/10.15497/RDA00066

Publishing Organisation

RDA Research Metadata Schemas WG

Date of Publication

2021-06-15

Topic

Discovery/ indexing/ search, Interlinking/ interoperability

Addressed Stakeholders

data service providers

Keywords

metadata, schema, cross-walk, schema.org

Text

In most cases, a crosswalk from a repository schema to Schema.org is required unless Schema.org is already the repository’s metadata schema. A schema crosswalk maps the elements from one schema to the equivalent elements in another schema30. We recommend the following practice on the crosswalk: First, look for existing crosswalks. If a schema (such as ISO19115:2003, DublinCore) has already been widely adopted by communities, it is likely that a crosswalk already exists. One should first discover and adopt an existing crosswalk, instead of attempting to reinvent the wheel; developing a crosswalk may involve extensive labour on concept mapping and may require community consultation. Some crosswalk collections can help to this end, for example the one created by the Space Physics Archive33. Furthermore, reusing an existing crosswalk would ensure that those repositories will align to the same terminologies, allowing better opportunity of integration across repositories and data held. This is beneficial to downstream application developers and users when they search for data across repositories via web data discovery applications. If there is no crosswalk that has exactly the same source schema and target schema as desired, it is still useful to reference existing crosswalks for how properties from two schemas are mapped, especially when one can find a crosswalk that has the same target schema to map to. Second, make your crosswalk openly available as early as possible. Even if a crosswalk is still under development, it is beneficial to open up a draft crosswalk to the community for feedback, making the crosswalk more adaptable and adoptable. This working group has collected about 15 crosswalks31. The 15 source schemas represent general data models (e.g DCAT, DCAT-AP and DataCite) and domain specific ones such as Geographic Information (ISO19115:2003), Bioschemas32, European Clinical Research Infrastructure Network (ECRIN) (Canham, 2020), and Space Physics Archive Search and Extract (SPASE)33. Third, map as many properties as needed for the identified purpose. Recommendation 1 identifies two main use cases for publishing structured data. If the only purpose is exposing data on web applications (e.g. Google dataset search), a repository only needs to map a minimal set of properties; however, mapping a lot more properties beyond the minimal set is required if a repository needs to exchange metadata with other repositories/aggregators of a specific domain, or to support more sophisticated queries and assist relevance assessment. Even when an aggregator recommends only a minimal set, it does not mean the aggregator is restricted by that set of recommended properties. For example, the Google dataset search guide34 recommends 20 properties, which do not include ‘date updated’, ‘data provider’ and ‘data funder’ fields often used by repositories or desired by funders; however, Google dataset search does parse and render some of these properties when present as shown in Figure 1b (labelled as ‘Dataset updated’, ‘Dataset provided by’, and ‘Dataset funder by’ respectively). It is also important to map those properties that are often searched for by data seekers. For example, Kacprzak et al. (2019) found, dataset search queries often include temporal and spatial properties, as well as properties data format and file type. The data format and file type properties are in Schema.org but not in Google dataset search guidelines36. If a property is important for a user to judge the relevance or utility of that dataset and this information is missing from a search result, the user may not refer to the source repository to explore further. On the Web, there may be unforeseen consumers who would harvest and parse as much structured data as is available on the Web, in order to provide advanced or domain specific data discovery tools. The more properties that are provided, and the more connectedness of data to other data provenance information, the more discovery paths can be created for data users. Fourth, take implementations of past versions of source schema or description of legacy data into consideration when adopting or developing a crosswalk. Sometimes, there is a clear mapping at the conceptual level; however, there may exist discrepancies between the latest schema and datasets that were described by following earlier versions of schema and/or implementation guidelines. For example35, for the latest version of schema: Registry Interchange Format – Collections and Services (RIF-CS V1.6.3), the property RIF-CS:location (type: url with property target=download) (describing the physical and/or electronic locations(s) of a registry object) can be conceptually mapped to Schema:DataDownload:distribution (the description of the location for download of the dataset and the file format for download). However, earlier version of RIF-CS didn’t have the target type “download”, thus past guidelines from the metadata aggregator Research Data Australia (RDA) instructed their content providers to use the property RIF- CS:location(type=url) to point to the source metadata landing page. Taking this historical development of schema into consideration, it is more appropriate to map the RIF-CS:location (type: url) from earlier versions to Schema:sameAs.