RDM and MD Landscape in Earth & Environment

HMC Home -> HMC Hub Earth & Evironment -> Catalogue of Resources

Go to a collection of other useful resources collected by the hub

Compilation of Recommendations

Details


Short Title

R 6: Be friendly to web crawlers

Source Documnent

Guidelines for publishing structured metadata on the Web V3.0

Source Document Link

https://doi.org/10.15497/RDA00066

Publishing Organisation

RDA Research Metadata Schemas WG

Date of Publication

2021-06-15

Topic

Discovery/ indexing/ search

Addressed Stakeholders

data service providers

Keywords

SEO, crawlers, discovery

Text

After structured metadata are properly implemented and embedded in a metadata landing page, the next step is to mark the URL (i.e., address) of the landing page into the sitemap of a repository, so that web applications like crawlers can follow the sitemap to find the landing page, and add or update that page in its searchable index. Some repositories that have already implemented structured metadata often complain that not all their landing pages are indexed by a web search engine, and feel frustrated not knowing the reason. Each crawler may have its rules (and limitations) on how and what to follow from a sitemap for optimising their user search experience; repository owners are recommended to check the rules from the target application for instruction on how to construct a sitemap. Failing to follow those rules may result in some metadata landing pages not being indexed. This recommendation addresses only those issues that may require special attention from a data repository. The recommendation does not guarantee each landing page with structured metadata will be indexed by web dataset search tools, however, it may help the diagnosis of why some landing pages are not being indexed. A metadata record may go through multiple revisions. A data repository may hold a metadata record for each revision with highly overlapping content (even each version has its own DOI for some repositories). A keyword search resulting in 10 metadata records for the same dataset does not constitute a good user search experience, especially where the latest metadata record version is ranked low on that list. In such instances, a repository should include only the url of the latest version into its sitemap, with the landing page including links to all previous versions of that metadata record. If changes are made to a metadata record, but the changes are trivial and don’t impact discoverability, then it is recommended not to update the html tag < lastmod>. In other words, update the tag lastmod only when substantial changes are made to a metadata record, avoiding unnecessary reindexing. A crawler may be limited in the number of URLs in a sitemap file, or else in file size it interrogates. Some repositories, especially aggregators, may have numerous metadata records; listing all urls in a sitemap may exceed the limitation of a crawler. In such cases, one can split a single, large sitemap into several smaller sitemaps, and set up a sitemap index file to point to each sitemap.