Importing Shapefiles in SAP HANA and the validity of geometries
Shapefiles are a common and popular geospatial data format, which was first established in the early 90’s by ESRI and can be used today with almost every geographic information system (GIS). The geospatial data are stored as vector features consisting of Points, Lines and Polygons. Since SPS7, SAP HANA can process spatial data. With SAP HANA Spatial it is possible to import Shapefiles into the In-Memory database. As SAP HANA is sticking strictly to the Standards of the Open Geospatial Consortium (OGC) the import of trivially invalid geometries into SAP HANA is currently declined. In the following I will explain what invalid geometries are and how to fix them so they can be processed with SAP HANA Spatial.
Polygons and Multipolygons with invalid geometries
Vector data used by GIS (including Shapefiles) consists of either Points, Lines or Polygons. At best these features are all valid. However, there can be data containing invalid features. A Point is invalid only if it is located outside of the spatial reference system (SRS), while a Line is also invalid, if it, for instance, consists only out of one Point. Unlike Points and Lines, there can be several reasons causing a Polygon to be invalid. In the following, I want to show the reasons for invalid Polygons and some examples of Polygons with invalid geometries.
A Polygon consists of one or more rings that are made up of contiguous Lines. Those Lines are defined by Points with either Cartesian coordinates (x- and y- coordinates) or Geographic coordinates (longitude and latitude). Each ring needs one Point which is the start- and endpoint. According to the OGC, a Polygon is a planar surface defined by one exterior boundary and zero or more interior boundaries. Those boundaries are represented by the outer ring and the inner ring(s). Each inner ring can be seen as a “hole” in the Polygon. The rings of a Polygon have a ring orientation that is either clockwise or counterclockwise. The orientation depends on the order of the Points which is defined in the SRS. Most of the time, if not defined differently in the SRS, the outer ring is oriented clockwise while the inner ring is oriented counter-clockwise. Imagine walking on the outer ring of a Polygon, everything on your right hand side is considered to be inside the Polygon, while everything on your left hand side is considered to be outside the Polygon.
The ring orientation can be a first barrier regarding the Import of Shapefiles. If the coordinates are listed in the wrong order, the ring orientation is incorrect. For example, if you try to import a Polygon representing your office building, and the outer ring has the wrong orientation, your Polygon excludes your building and includes everything on earth except this building. Because this Polygon is simply too big and spans over more than a hemisphere, it is not allowed to import such geometries into HANA. But don’t worry, I will explain later which tools can be used to solve this problem.
Other reasons causing a Polygon to be invalid are self-intersections, no surface, dangling segments or a location outside the SRS. Those invalid geometries can be caused by different reasons and processes, for example loading, clipping or manipulating data. The OGC defines rules for the validity of a Polygon and Multipolygons, which are listed below.
According to the OGC, a valid Polygon is defined as the following:
- Polygons are topologically closed
- All boundaries or rings must be closed
- The boundary of a Polygon consists of a set of rings that make up its exterior and interior boundaries
- The rings inside the boundary of a Polygon may not cross and may only intersect at one Point
- The exterior of a Polygon with one or more holes is not connected. Each hole defines a closed and connected component of the exterior. Consequently this means that the interior of a Polygon must be connected
A collection of one or more Polygons, which stand in a relation to each other, is called a Multipolygon. These Multipolygon are handled by a GIS as one element or feature.
According to the OGC, a valid Multipolygon is defined as the following:
- The interiors of Polygons that are elements of a Multipolygon may not intersect and the boundaries may touch at only a finite number of Points
- All boundaries or rings must be closed
- Multipolygons are topologically closed
- The interior of a Multipolygon with more than one Polygon is not connected. More specifically, for every Polygon which is part of a Multipolygon, there is exactly one connected component
In Figure 1 you can see examples of Polygons and Multipolygons with invalid geometry and their correct counterpart with valid geometry. The dark blue points represent the vertices of the outer ring while the light blue points represent the vertices of the inner ring(s).
Figure 1: (Multi-)Polygons with invalid geometries and their valid counterparts
Repairing Shapefiles with GIS Software
If you have trouble with the import of Shapefiles into SAP HANA because of invalid geometries, there are two easy options you can try. Exemplarily, I want to show you a commercial tool from Esri’s ArcGis which is suited for repairing shapefiles, followed by an open-source tool from QGIS. However, it is not guaranteed that the tools will correct all invalid geometries. There are other GIS tools which are also capable of repairing your invalid geometries.
1. ESRI – “Repair Geometry” Tool:
In ESRI’s ArcGis there is a data management tool called “Repair Geometry”. According to ESRI this tool inspects each feature’s geometry for problems and fixes them. This tool works with Shapefiles and feature classes stored in a personal geodatabase or file geodatabase.
You can find this tool in ArcMap in the ArcToolbox under Data Management/Features by typing “Repair Geometry” into the Search window (Ctrl+F).
After opening the tool, you have to select an input feature (in our case you just select your trouble causing Shapefile) and run the tool by pressing the OK-Button. The tool won’t create a new Shapefile but it will overwrite your input feature.
2. QGIS – “v.clean” tool (GRASS GIS 7 commands):
There is a tool in the free and open source software QGIS called “v.clean” which cleans the topology of a vector map. The tool is actually a GRASSGIS tool which can be used in a user-friendly graphical interface of QGIS. You can find the tool in the processing toolbox under GRASS GIS 7 commands/vector or again by using the search window in the processing toolbox. Once you opened the tool you can select the layer to clean (again just select your trouble causing Shapefile). Unlike the repair tool of ArcMap, the v.clean tool doesn’t overwrite your Shapefile, which means that after you have run the tool you will have two Shapefiles. One which still has the invalid geometries and another cleaned one. You can either select a temporary output file that is only existing in your QGIS, or you create a new Shapefile (I recommend the latter). You can also run the tool as batch process, so you clean more than one Shapefile at once. In the v.clean tool, there are several other parameters you can choose, such as the different cleaning tools. For my test case I used the default settings, which worked fine. But for those of you who want to take a look at the different parameters, you can find a description of them here.
In Figure 2, you can see a comparison of the output that the two tools produce and how they change the examples to make them valid. Again, the dark blue points represent the vertices of the outer rings, while the light blue points represent the vertices of the inner rings. Other than in Figure 1, there are points with both colors. At these points there are vertices of the outer ring as well as vertices of the inner ring. The red (Multi-)Polygons are invalid, while the green ones are valid.
Figure 2: Comparison of the output produced by “Repair Geometry” and “v.clean”
As you can see, the fifth Polygon is still invalid after the repairing process with both tools. Such a Polygon is an example why it can happen that the import in SAP HANA is still not possible after the use of both tools. In such cases you can use either GIS to find the affected Polygons. In ArcGis, there is the tool “Check Geometry” which gives you a table with all invalid Polygons. A similar tool in QGIS is called “Topology Checker”. Just select the rule “must not have invalid geometries” and check the Shapefile for invalid geometries. The invalid Polygons will be highlighted in the map and also shown in a table. If you figured out which Polygons are affected, you can either repair them manually or you can try the following workaround.
3. Workaround – Creating a buffer:
If both options do not work for you, there is a third option which works with most GIS (including ArcGis and QGIS). Create a subtle buffer around your Polygons. In my test case, I was able to create a buffer of at most seven centimeters for all of my Shapefiles. The procedure is mostly similar to the workflow of the other options. You just have to select your input feature, output feature and the buffer distance. In most of the buffer tools, there are several other parameters you can choose, but usually the default settings work fine. While ArcGis can create a buffer in different units, QGIS uses a fixed unit for the buffer distance. In QGIS it is only possible to create a buffer fitting to the unit of the coordinate reference system. For example if your Shapefile is referenced in WGS 84, it is only possible to create a buffer in decimal degrees.
The SAP HANA spatial development team will soon present an extended validity concept. With this concept, it will be possible to import all shapefiles, no matter if the contained geometries are invalid or not. Invalid geometries will be marked and excluded during spatial processing and can be found by the ST_IsValid function. Automatically correcting the data is still left to the customer and can be done as described above.