gml:id, gml:identifier and the InspireID – Clarifications and Best Practices
Many people who create GML, and in particular INSPIRE GML, hit some common challenges around identifying features. In part, these come from technical requirements of XML/GML, and in part they come from INSPIRE requirements.
An INSPIRE feature will generally have three properties that identify objects, each with a different purpose:
gml:id: This is the mandatory XML element ID, and it is encoded as an attribute of the element. It is used to uniquely identify that element in the current document, and serves to identify the target object of an Xlink. It has to match a defined pattern, e.g. it must start with a letter or underscore. It is first and foremost a technical identifier, though it should be stable over time (e.g. over multiple transformation runs) and should thus be grounded in a property of the source feature. Only if it is stable over time, Xlink references across documents can actually work. The
gml:id is used by the WFS standard query
inspireId: This is a specific, often mandatory, complex property of INSPIRE objects, which consists of three sub-properties -
version. The INSPIRE ID should be stable, and is usually used to clearly identify the object in its specific domain. Often, existing keys are re-used 1:1 as the
gml:identifier: This is the optional external element ID, i.e. it should include a namespace to make it globally unique, not just in the current document. It is a standard property of all GML objects, it is encoded as an element, and is of the type
gml:CodeWithAuthorityType. This is also a technical identifier which should be stable over time. INSPIRE recommends to use the
localId from the
inspireId to build the
identifier, and INSPIRE
identifiers use this codespace:
Here is a complete example for a feature with all three properties:
The following guidelines on how to form these different types of IDs are partially based on the guidelines that the German AdV (Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder Deutschlands) has developed. We’ve used those successfully in hundreds of transformation projects.
gml:Identifier and for the INSPIRE ID, you will need to define a dataset namespace. The dataset namespace needs to be unique within all of the INSPIRE infrastructure, and will be used for one data set only. There are generally two common patterns for these namespaces:
- Technical namespace: There is one central namespace for all resources in your local spatial data infrastructure. All resources get a technical identifier, such as a UUID, which together with the registry URL, forms the dataset namespace, such as in this (fictitious) example:
Such namespaces are easy to generate, and collisions are very unlikely.
- Semantic namespace: A semantic namespace identifies the data owner, as well as some properties of the dataset, such as the INSPIRE theme it belongs to, and what data it was derived from. This is a real example: http://www.swisstopo.ch/inspire/au/4.0/swissboundaries3d/
Both approaches have some advantages and disadvantages, so it comes down to what you want to achieve by using the namespaces. For all kinds of namespaces, there is often a national or regional registry (such as the GDI-DE Registry) where INSPIRE implementers have to register their organisation and dataset namespaces.
General Rules for IDs
In most situations, we recommend to have the values for
gml:id, localId and the local part of the
gml:identifier to be identical. Since we often generate multiple INSPIRE objects of different INSPIRE Feature types from one source object, we need to differentiate these objects and thus prefix the domain key with the INSPIRE type name, e.g. like this:
We used both underscores and points to separate the INSPIRE type name from the domain key, there is no inherent difference. The domain key has to be a unique property in all source objects, or it has to be generated. Using a unique source property is highly preferred, as only that guarantees a stable ID over multiple transformation or generation runs.
In some cases, the source objects have a unique domain key that uses a problematic format (e.g. containing spaces or backslashes). If uniqueness can still be guaranteed by removing the special characters you can just strip them, otherwise, we recommend to use the source domain key as input to either generate a UUID, or to generate a Hash value. To generate a Hash value, we recommend the SHA-256 algorithm.
This approach has several advantages:
- It guarantees a valid ID, which needs to start with a non-numerical character
- It differentiates multiple objects created from one source object
- It immediately tells a viewer what kind of object this is, which is especially useful in references
The question how you can build references is often the key to determine which source domain key is used best. This requires a stable, reproducible generation method that we can also employ in places where the original source object was referenced. So, when in doubt, your domain key should always be the value that is used in the existing data to create references (e.g. a Foreign Key in a data base table).
There are many cases where we create an INSPIRE object from multiple source objects. As an example, we merge a set of
WaterwayLinks to create
InlandWaterways. In this case, we still want to create a stable ID for the merged object. We do this by concatenating the domain keys of all the merged objects and then calculating the SHA-256 Hash value of the resulting string. This gives us a long, but still manageable ID:
Another approach would be to use the value that was used to group objects as the domain key part. This creates a semantically meaningful identifier, like in this example with a stripped name:
In some cases, we need to disassemble a source object and create many INSPIRE objects of the same type from it. The most common use case for this is when the INSPIRE schema only allows simple geometries, and we have to split up a
MultiGeometry. In this case, we apply the same rules as for the simple 1:1 creation, but add a postfix to the ID that uses the index of the property on which we split the object. In this example, we look at the 22nd object created from splitting out a source object(we start with 1, not with 0):
In a join, we use multiple objects of different source types to form an INSPIRE object. As an example, we might join a
Municipality and a
District object together to create an
AdministrativeUnit with references to
lowerLevelUnits. If there is a reason why we can’t just use the domain key of the
District, our recommendation is to also use multi-component IDs for this case. In a Join, there is always a “focus” or “root” object, to which matching objects of other types are added. In this example, we try to find all
Municipalities belonging to the
District, so the
District is the focus object. We use the domain key of this root object as we would for a simple 1:1 creation. However, we then add another key created by concatenating the domain keys of the joined objects (the Municipalities), like we do it in the Merge case. This means we take the concatenated IDs of the Municipalities and then create a SHA-256 Hash value, which is then added to the other parts of the ID:
Creating stable IDs that can be referenced is somewhat complex. However, we’ve used the rules above as well as some variants over a few hundred projects by now and they work very well. Do you have ideas on who to improve or complement them? Let us know!