High Value Dataset Metadata
The High Value Dataset (HVD) Implementing Regulation entered into force on June 9, 2024, with the aim of making a range of public sector data freely available to the public. The concept of high value datasets first appeared in the Open Data Directive, as a means of improving access to important datasets necessary for economic, social and environmental development.
As discussed in our introduction to the EU Implementing Act on High-Value Datasets, there are six categories of data into which HVDs can be classified:
- Geospatial data
- Earth observation
- Environmental data
- Statistics
- Company ownership
- Mobility
To promote the interoperability and accessibility of datasets subject to the HVD IR, a DCAT-AP HVD metadata profile was introduced to govern the provision of metadata for these datasets.
For organisations that have been making public sector data available for years under the INSPIRE Directive, several important questions arose surrounding overlapping requirements and datasets that are subject to both the INSPIRE Directive and the HVD Implementing Regulation, and metadata requirements, as the INSPIRE Directive requires metadata based on ISO/TS 19139:2007.
The discussion and development of high-value dataset metadata reporting practices has required alignment between a large number of groups at the EU and national level including national metadata reporting authorities, open data portals, groups responsible for standards such DCAT-AP, GeoDCAT-AP, and INSPIRE, as well as data owners and providers.
What are the metadata requirements for HVD datasets?
The Data Catalog Vocabulary (DCAT) is now in version 3.0. It is a Resource Description Language (RDF) vocabulary which supports communication between general data catalogs on the web. RDF is a graph composed of triples: a node for the subject, an arc between a subject to an object, and a node for the object. The DCAT Application profile for data portals in Europe (DCAT-AP) is based on DCAT and is used to describe public sector datasets in Europe and support interoperability. GeoDCAT-AP is a geospatial extension of the DCAT application profile for data portals in Europe (DCAT-AP).
The DCAT-AP HVD, or "usage guidelines of DCAT-AP for High-Value Datasets" describes an HVD profile of DCAT-AP which supports the interoperability of catalogued resources. Conformance with the DCAT-AP HVD profile requires compliance with DCAT-AP as a prerequisite. The HVD profile defines additional mandatory and optional metadata elements in addition to existing, required DCAT-AP metadata elements.
For INSPIRE implementers currently providing ISO metadata for datasets identified as HVD, the path forward is being developed.
Over the past several years, a lot of important groundwork has been done to develop tooling for the INSPIRE community to transform their INSPIRE ISO metadata to GeoDCAT-AP. Version 3.0.0 of GeoDCAT-AP was released on July 9, 2024, and under review until September 27th, 2024. The latest version contains guidance on the usage of DCAT-AP HVD for GeoDCAT-AP. An XSLT tool, updated to use GeoDCAT-AP v3, and a demo API developed by the Semantic Interoperability Community are freely available for use. The tools convert INSPIRE ISO metadata to GeoDCAT-AP, however it has been determined that the ISO 19139 to GeoDCAT-AP XSLT will not produce fully compliant DCAT-AP HVD metadata.
The RDF generated by the XSLT must be manually corrected on a case-by-case basis to be fully compliant against the DCAT-AP HVD profile. The XSLT limitations are related to HVD requirements for URI based licenses and persistent and dereferenceable IDs. These limitations will drive the need for custom solutions or further extensions, which add the required values as needed.
[From Interoperable Europe: Introductory webinar on the revision of GeoDCAT-AP on February 20, 2024]
How will HVD metadata be reported, or harvested?
HVD metadata, once harvested, will be available for search on Europe’s open data portal, data.europa.eu. The reporting requirements for HVDs on data.europa.eu state that an HVD report must be submitted by Member States by February 9th, 2025, and then again every two years. It is stated that the report must contain the following information:
- list of published high-value datasets at Member State level (and, where relevant, subnational level) with online reference to metadata;
- persistent link to the applicable licensing conditions;
- persistent link to the APIs ensuring access to the high-value datasets
The DCAT-AP HVD specification also includes a section on reporting which includes the requirement for EU Member States (the reporting authority) to provide a list of HVD datasets. Although no explicit technical guidelines on how this should be implemented are provided, it is suggested that a catalogue endpoint which contains all the metadata about Datasets, Distributions and Data Services that are in scope of the HVD could be used to achieve the reporting requirement. The reporting section goes on to describe a scenario where Member States could query national catalogs for all resources with http://data.europa.eu/eli/reg_impl/2023/138/oj as applicable legislation, and a scenario where the EU could query Member States' national catalogs where persistent identifiers are assigned as specified in DCAT-AP HVD section 10.4.1 Persistent identifiers.
The Action 2.5 HVD and INSPIRE survey outcomes presented in the 79th INSPIRE MIG-T meeting included a recommended scenario for HVD reporting:
Automated reporting of a list of HVD from data. europa.eu where data.europa.eu harvests the national open data portal and the national geodata portal (mapping based on GeoDCAT-AP for geoportals). The MIG-T is currently working to develop a best practice guide for HVD metadata and has released the results of a multi-national survey on the topic. The survey results will be included as recommendations in the guidelines. It is evident, due to the number of current discussions in the various standards communities, that the development and alignment of rules and reporting mechanisms for HVD are still on-going. Data.europa.eu recently released a research paper entitled “High Value Datasets Best Practices in Europe” which highlights this point, among other concerns raised by Member States including Finland, Denmark, Italy, Austria, and others currently implementing the HVD regulation.
How are different Member States providing HVD metadata?
In the 79th MIG-T meeting, the recommendation that high-value datasets should be identified in metadata by providing the link to the HVD IR and HVD categories provided in EuroVoc was decided. The recommendation will be included in the upcoming HVD and INSPIRE best practice document. Additionally, it was recommended that licenses need to be structured and machine readable. Only URI based licenses and access right should be allowed. This recommendation would require an update to existing metadata if URI/IRIs are not currently used.
At wetransform, we have been following the steps that national metadata leads have been taking, despite the lack of clear technical guidelines, to adapt ISO metadata requirements to include information that can be used to query and extract HVD relevant metadata from their CSW catalogs for the purposes of conversion to DCAT-AP HVD.
Germany
Geodateninfrastruktur Deutschland (GDI-DE) maintains a helpful wiki to discuss HVD related questions, including questions about HVD metadata. In Germany, the “Konventionen zu Metadaten” is currently being updated to include requirements for additional keywords in ISO metadata to identify high-value datasets. A provisional guideline explaining the changes has been provided in the meantime. The keywords will be used to identify high-value datasets and make them available in DCAT-AP.de, which is the German profile of DCAT-AP, and in the open data portal GovData.de.
The sample representation in the XML metadata set (excerpt) looks as follows:
...
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gmx:Anchor xlink:href="http://data.europa.eu/bna/c_ac64a52d">Georaum</gmx:Anchor>
</gmd:keyword>
...
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gmx:Anchor xlink:href="http://data.europa.eu/bna/asd487ae75">High-value dataset categories</gmx:Anchor>
</gmd:title>
<gmd:date>
<gmd:CI_Date>
<gmd:date>
<gco:Date>2023-09-27</gco:Date>
</gmd:date>
<gmd:dateType>
<gmd:CI_DateTypeCode codeList="https://standards.iso.org/iso/19139/
resources/gmxCodelists.xml#CI_DateTypeCode" codeListValue="publication"/>
</gmd:dateType>
</gmd:CI_Date>
</gmd:date>
...
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
...
Denmark
In Denmark, an approach similar to Germany’s will be taken. While similar, the keyword and tagging approach is not identical, highlighting the need for standardized guidelines on how to add the EU controlled vocabulary on HVD category, or other relevant values, to ISO metadata.
As seen in GitHub: Danish proposal for tagging metadata for data sets that are in scope of HVD, there is a tag saying that the data set is in scope of HVD (gmx:Anchor xlink:href="https://eur-lex.europa.eu/eli/reg_impl/2023/138/oj"):
gmd:descriptiveKeywords
gmd:MD_Keywords
gmd:keyword
<gmx:Anchor xlink:href="https://eur-lex.europa.eu/eli/dir/2007/2/2019-06-26">INSPIRE</gmx:Anchor>
</gmd:keyword>
gmd:keyword
<gmx:Anchor xlink:href="https://eur-lex.europa.eu/eli/reg_impl/2023/138/oj">Høj-værdi datasæt</gmx:Anchor>
</gmd:keyword>
...
gmd:thesaurusName
gmd:CI_Citation
gmd:title
<gmx:Anchor xlink:href="https://registry.geonetwork-opensource.org/theme/eu">EU legislation</gmx:Anchor>
</gmd:title>
...
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
Another tag that specifies that the data set belongs to Earth observation and environment (da: Jordobservation og miljø) category (gmx:Anchor xlink:href="http://data.europa.eu/bna/c_dd313021"):
gmd:descriptiveKeywords
gmd:MD_Keywords
gmd:keyword
<gmx:Anchor xlink:href="http://data.europa.eu/bna/c_dd313021">Jordobservation og miljø</gmx:Anchor>
</gmd:keyword>
...
gmd:thesaurusName
gmd:CI_Citation
gmd:title
<gmx:Anchor xlink:href="http://data.europa.eu/bna/asd487ae75">High-value dataset categories</gmx:Anchor>
</gmd:title>
...
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
The Netherlands
The Netherlands has published guidelines on implementing HVDs for data providers in their country. In the Netherlands, geospatial data is delivered using the Dutch metadata profiles for ISO 19115 and 19119. The conversion of INSPIRE metadata to DCAT is handled by the national catalog. Again, similar to Germany and Denmark, the Netherlands recommends the addition of a keyword in the ISO metadata to identify HVDs:
- For the designation 'high value dataset', the European Legislation Identifier (ELI) must be included.
- The keyword for the HVD theme must be a value from the High-value dataset categories
This can be included in the metadata as follows:
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gmx:Anchor xlink:href="http://data.europa.eu/eli/reg_impl/2023/138/oj">HVD</gmx:Anchor>
</gmd:keyword>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gmx:Anchor xlink:href="https://op.europa.eu/web/eu-vocabularies/concept/-/resource?uri=http://data.europa.eu/bna/c_dd313021"> Earth observation and environment</gmx:Anchor>
</gmd:keyword>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gmx:Anchor xlink:href="http://publications.europa.eu/resource/dataset/high-value-dataset-category">High-value dataset categories </gmx:Anchor>>
</gmd:title>
<gmd:date>
<gmd:CI_Date>
<gmd:date>
<gco:Date>2023-09-27</gco:Date>
</gmd:date>
<gmd:dateType>
<gmd:CI_DateTypeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/Codelist/ML_gmxCodelists.xml#CI_DateTypeCode"codeListValue="publication">publication</gmd:CI_DateTypeCode>
</gmd:dateType>
</GMD:CI_Date>
</gmd:date>
</GM:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
Providing HVD metadata on a technical level
There are several possible approaches to providing DCAT-AP HVD metadata, however it should be noted that in the case of HVD metadata, there are still a number of open questions related to how the metadata will be delivered from a technical point of view.
One option would be to support a triple store capable of generating the required RDF. A triple store or RDF store is a database for the storage and retrieval of triples through semantic queries. SPARQL is a standardized semantic query language that can be used to retrieve data stored in RDF. It is clear based on data.europa.eu's documentation site that the SPARQL queries for HVD metadata are still in development. Open data in Europe is available via a SPARQL endpoint provided by data.europa.eu. Another option would be to use an XSLT bridge which converts records available via CSW endpoint to the required DCAT using an XSLT mapping, such as the one provided by SEMIC. Alternatively, a custom solution which includes a mapping between metadata models could be developed in hale»studio.
How will we achieve compliance for our customers?
As a first step towards fulfilling the HVD requirements of GDI-DE and other Member States, we have added the ability to cite a controlled vocabulary to hale»connect generated ISO metadata. This will enable users to provide the required HVD keywords in their metadata.
You can learn more about our approach to the Act on High-Value Datasets on our dedicated page for HVD. This also features a recording of a webinar where we explained the most important requirements and possible solutions.
Should you desire further assistance in fulfilling your HVD requirements, please feel free to reach out to us.