Modern environmental protection is built on how well we understand what is happening in the world.
To augment what human experts can understand, Artificial Intelligence (AI) approaches such as machine learning were developed. They help data scientists analyse vast amounts of data faster and expose unseen patterns, without ever getting tired.
To do all of that, all it needs is data.
Machines learn by analysing data.
The basic methods for machine learning date back to the 1950s, but it wasn’t until the 2000s that scientists managed to truly hit their stride with a data-driven approach. The more data became available, the more effective their models could be.
Training AI means that it needs to be shown lots of data in order to recognise patterns and learn how to respond to them. The amount of relevant data that can be acquired and fed into the AI’s model directly impacts how well that model functions.
The gathering and cleaning of this all-important data makes up the first half of a process called data engineering. In this part of the process, data scientists hunt down and comb through historical data from a multitude of different sources. This data is scattered, uses different vocabularies, and contains many inconsistencies, so their human expertise is required to put everything into a format their model can understand.
The other half of the data engineering process, feature engineering, focuses on tailoring the collected data to the model to make it work better. For example, data scientists can combine two highly correlated variables into a single input feature, transform words into vectors, or even utilise the output of a different model as an input for the new one (“ensemble learning”).
The data engineering process makes up roughly 80% of any machine learning implementation.
Imagine how much better the models could work if data scientists could put all of that time into feature engineering and innovation rather than gathering and cleaning data.
If large pools of harmonised data are made readily available, they can do just that.
Large pools of harmonised data can be contained within data spaces. Though the data in these spaces stems from different sources, it adheres to common standards. That means the data is both easily found and integrated, freeing up the data scientists’ time to do more data-centric modelling and push innovation forward.
The other advantage harmonised data spaces bring is reproducibility.
In machine learning, much like in any other scientific field, the ability to reproduce past experiments is important. It can confirm results or improve them with new experiments.
Shared data spaces allow different data scientists to easily access the exact same data sources used in an original experiment, likely with even more data present due to the passage of time. This makes it far easier to reproduce the original and build upon it with new projects.
Existing models can be verified and adapted to function in different-yet-similar scenarios. A model that aids forestry in Germany can aid forestry anywhere in the world, provided the data scientists can feed it the matching data for their local scenario.
With harmonised data spaces, they can.
Europe is on its way to becoming a climate neutral continent in less than 30 years. To make decisions on this challenging path that make sense economically, but also help improve biodiversity and resilience, access to environmental and climate data will be essential.
By creating the Environmental Data Spaces Community, wetransform wants to support the establishment of data ecosystems that use environmental data. Members of this community include public authorities, industrial, and academic organisations. Together, we will pursue our objective to make environmental data accessible in a safe data space that ensures data sovereignty.