The Value of Dataspaces in Geodata Infrastructures
The German version of this article can be accessed here.
Society is facing major challenges such as climate change and loss of ecosystems. For these challenges, we will have to find well-optimised, sustainable solutions. In this process, data is invaluable.
Thanks to Copernicus, INSPIRE, and the Open Data Directive more and more geodata has become publicly available. However, this data is merely the tip of the proverbial iceberg. Large swathes of relevant data remain hidden from view due to security concerns and legal obligations such as GDPR. The new European Data Strategy aims to make this pool of data more accessible by utilising a concept that’s already shown itself to be successful in the automotive industry. This concept is known as data spaces. In this article, the differences between the existing geodata infrastructure (GDI) and data spaces will be explained, as well as which issues can be resolved through the use of these data spaces, and what will need to be kept in mind when implementing them.
The automotive industry is adept at finding efficient solutions to incredibly complex issues. Their products need to conform to a myriad of both legal and scientific standards at every stage of their elaborate research, development, and production chains. In an effort to make these processes more cost-effective and secure, car manufacturers and their suppliers exchange data in a limited fashion.
Originally, companies only had a limited and often flawed insight into their supply chain. To remedy the issues that kept cropping up due to this lack of transparency, the Catena X Data Space was created. This data space allowed all participating organisations to exchange their data inside a single, secure platform. Who is allowed access to which data, and to what purpose, is decided by the organisations involved.
Bolstered by the success of this approach, 2021’s European Data Strategy builds upon the concept of data spaces to unlock the hitherto hidden potential of closed data in other sectors. In the new model, data spaces are set to support nine strategic areas: Agriculture, Environment, Energy, Finance, Healthcare, Manufacturing, Mobility, Public Administration, and Public Authorities. A data space to support the implementation of the Green Deal is already in the works. Every data space will contain a combination of public and proprietary data from both companies and governmental organisations.
What is a data space?
Governance is central to the data space concept. A combined set of rules and standards, as well as their technical implementation, that defines which roles exist within the data space and the level of access to data that each of these roles provide. For example, data providers can allow their data to be used within a training pool for AI models, but severely limit the export of that data outside of the data space. Common technical standards will have to be agreed upon as well, particularly data models such as INSPIRE, XPlanung, or 3A/NAS for the Geospatial and Environmental sectors.
Just as in GDI, source data sets will differ. Every organisation can create, house, and utilise their data in whatever manner they desire, be it on premise or in the cloud. Controlled access to that data can be securely managed through an adapter such as the Eclipse Dataspace Connector.
All data sets within a data space are interoperable. That does not mean that all data needs to conform to the same format or schema, but rather that they can automatically be integrated and harmonised as required. For this, matching- and mapping technology will be utilised, such as annotations (“This is a parcel.”). ETL tools like hale»studio and hale»connect can use this metadata to automatically prepare data for processors in different parts of the data space.
Such processing services are themselves part of the data space. How these services are allowed to access and use the data is established within the communal rules, for example whether or not they’re allowed to be temporarily cached. Trust plays an important part in this. Starting in 2022, processing services are able to obtain certifications. Once a service is certified, all participants in the data space can be certain that this service will only do exactly what it claims to do.
Which issues do data spaces solve?
The creation of a data space only makes sense where there is a concrete use case where vital data gaps can be closed through the use of previously inaccessible data. These data gaps need to be defined and thoroughly documented.
Such a data gap also exists in scenarios where there is data available, but not in sufficient quantity to train a useful AI model. Within the security of the data space, a much greater amount of training data can be made available. Since only the final AI model will be exported out of the data space, the confidentiality of the training data remains unaltered.
There is another problem data spaces solve. It is common practice for modern platforms to siphon off and sell large amounts of data without any input from the subjects of said data, be it companies or private citizens. Within a data space, rules can be established not only to secure data sovereignty, but also to allow a more balanced division of the value generated through that data. For example, through a “Pay as you go”-model.
In order to provide this data sovereignty, the data space has to be built upon hardware, software, and operating systems that have been designed and secured to allow for it. Therefore, the data spaces’ infrastructure is being created in collaboration with GAIA-X, Europe’s distributed cloud platform.
What does this mean in the real world?
The AI pilot project FutureForest.ai is a great way to illustrate the usefulness of data spaces. In this project, wetransform collaborates with the TU Munich and the TU Berlin, as well as several German state forests and forest research institutes to create a data space for forestry data. All these organisations contribute access to their data within the data space, so better decisions can be made surrounding climate-adapted forest conversion. This combines both public data, such as elevation models and land coverage maps, and private data, such as sensor data and detailed information from location mapping. The forest owners contribute their data and in return are able to leverage better decision-making models.
This last decade has allowed us to make great strides in terms of spatial data accessibility, chiefly through open data initiatives. Unfortunately, a lack of attention paid to organisational frameworks and data usage conditions often still hamper progress. Through the use of data spaces, such as the one for forestry, this will change for the better.
More than Projects – the Environmental Data Spaces Community
It’s still early days in the Geo- and GIS-community when it comes to the implementation and usage of data spaces. Many projects are being launched, both nationally and internationally. In order to create a network between all the different parties currently involved in these projects and provide more developmental continuity, wetransform has established the Environmental Data Spaces Community. Aided by several partners and the framework laid out by the International Data Spaces Association, which sets the standards for data spaces, wetransform supports the creation of diverse data ecosystems with the goal of making environmental data accessible and usable inside a secure data space that protects data sovereignty.
More information surrounding the Environmental Data Spaces Community and the possibilities for those desiring to join it, can be found here.
This article originally appeared in German in gis.business 2-2022, 25-27.