On Machine Learning Without Location Data
This blog is about analyzing the same dataset with and without considering the location dimension in order to quantify the benefit of handling spatial data.
Well, the title is a trap. Actually this blog is all about using location data in machine learning models. However, instead of preaching that the location dimension is equally important as time, I would like to show you an end-to-end example of analyzing the same dataset with and without location information. Eventually, we will compare two trained ML models and will quantify the impact of location information on our prediction quality.
So, cards on the table! This time it’s not just preaching!
Along the same lines, you will find the full Jupyter Notebook to replicate the example on my GitHub Repository. By going into the details, you will learn how to
- Upload a csv containing spatial data (housing sales) to HANA Cloud
- Train a regression model on the housing price without spatial features
- Retrieve and upload OpenStreetMap data for the respective areas
- Compute spatial features by using database functions
- Train the same regression model including those spatial features
- Compare both models
- Predict the price of a reference house across different locations across the city
Also, I will use some brand new features of SAP HANA Spatial, which have been introduced with HANA Cloud and HANA2 SPS05. Following features are covered by the example:
- Creation of predefined Spatial Reference System. HANA Cloud and HANA2 SPS05 carry the definitions of more than 9000 SRS under the hood. Simply issue the following statement to create a predefined SRS:
CREATE PREDEFINED SPATIAL REFERENCE SYSTEM IDENTIFIED BY <srs-id>
- Voronoi Tessellation. With HANA Cloud and HANA2 SPS05 we introduced the in-database Voronoi tessellation. SAP HANA is the first database to deliver Voronoi cells by preserving the relation between input points and output cells, which enables sophisticated advanced analytics usecases. Check out this video to learn more. Voronoi cells can be calculated by either calling ST_ClusterCell on k-Means cluster or using the ST_VoronoiCell window function on point sets.
ST_VoronoiCell(<point_column>, <margin>) OVER (<optional_over_parameters>)
- K-Means Clustering. This clustering method can be used to split points into a fixed number of clusters by minimizing the squared distance of each point to its cell centroid.
SELECT COUNT(*), ST_ClusterID(), ST_ClusterCell() FROM table GROUP CLUSTER BY point USING KMEANS CLUSTERS 30
- Hexagonal Clustering. The useful feature of in-database hexagonal clustering has already been discussed in many other blogs. This time, the clustering will not be used as a location proxy for ML models, but rather for map visualizations and generating a grid of points.
SELECT COUNT(*), ST_ClusterID(), ST_ClusterCell() FROM table GROUP CLUSTER BY point USING HEXAGON X CELLS 5;
- Last but not least the “Spatial bread-and-butter“:
About the Data
The dataset for this example can be downloaded on Kaggle. It contains almost 35000 housing sales from years 2016, 2017 and 2018 in the City of Melbourne, Australia. Unfortunately only 27000 are geo-referenced with proper longitude and latitude information. Our analysis will be based upon those 27000 records.
We will use the data of 2016 and 2017 to train our model (86% of records). The model will then be applied to the data of 2018 (14% of records) for testing the prediction quality.
In the initial dataset there are already a few columns included, which act as a proxy for geographic location information. These are suburb, address, distance (to city center), postcode, councilarea, regionname and propertycount (within the neighbourhood). We dismiss all those columns to train a model which is solely relying on the attributes of the house itself. Afterwards we will use only latitude and longitude information to generate similar (or even better) spatial features on a database level.
These are the features, that we will consider in the following:
- id – identifier for the record
- type – type of property
- h – house,cottage,villa, semi,terrace
- u – unit, duplex
- t – townhouse
- rooms – number of rooms
- bedroom2 – number of bedrooms
- bathroom – number of bathrooms
- car – number of carspots
- landsize – land size in sqm
- buildingarea – building size in sqm
- yearbuilt – construction year
- shape – lat/lon information
- price – the sales price in Australian Dollars
Our database representation of latitude and longitude will be a column of type ST_GEOMETRY(28355). SRS 28355 is a planar spatial reference system, which is suitable for Australia. This SRS is not installed per default on SAP HANA. However, with the latest release, the database already knows the definition of the SRS and we simply need to issue the following statement to make it usable:
CREATE PREDEFINED SPATIAL REFERENCE SYSTEM IDENTIFIED BY 28355
With this new feature, prior approaches such as the metadata installer tool or the creation of spatial reference systems by its definition got obsolete for EPSG Spatial Reference Systems. The units of measure, which are used by the predefined SRS, also come out of the box. No need to install them separately.
(Note that it is still possible to create spatial reference systems by its definition. It is just not required, when using the predefined SRS)
ML Model without Location Information
As a reference, we will first train an XGBoost regression model for the price incorporating the features listed above (except field shape, of course). To explain the model and understand the impact of features/attributes on the price, we leverage the SHAP library.
The median absolute error of our model is $220k on the training data (year 2016 and 2017) and $300k on the test data (year 2018). To set that into relation, we should note that the median housing price in the dataset is $910000. With this we can draw to conclusions:
- We are more than 30% off on our test dataset
- The behaviour of our model on training and test dataset differs significantly, which means that our model is not very robust for new data.
Still, let’s take a look at the model explanation by SHAP and the importance of features:
How do we read this graphic? The color corresponds to the value of the individual features. Red dots are records with a high value and blue dots records with a low value. The x axis describes the level on influence on the price (regression target). The vertical line at zero behaves ‘neutral’. The sorting of features corresponds to the importance for the model. This most important variable in this case is yearbuilt. Let’s look at two examples to make that clearer:
- yearbuilt. The higher the value of yearbuilt, the lower the price. This means old houses tend to be more expensive. What first sounds counter-intuitive (at least if it’s not a historic building) can possibly be explained by the fact, that older houses are typically located in long-established and thus central regions. New houses are often located in development areas in the suburbs. So eventually the spatial dimension may have sneaked in into our non-spatial model…
- type_t. The fact that the house is a townhouse does only have a low impact on the outcome. The values are all centered around 0. The fact if a property is only a unit (type_u) or a detached house (type_h) is far more important.
Compute Spatial Features
Now, let’s see if we can improve the model by enhancing our dataset with geo-spatial features. The geo-spatial features in our case can be divided into 3 sub-categories:
- Based on OSM Point of Interest data
- Artificial Region Proxy
- Areas of Interest
OpenStreetMap Point of Interest Data
To understand the surroundings and nature of the different neighbourhoods, we first of all retrieve and upload poi data from OpenStreetMap using the OSMnx library. To determine the query range, we first check which area is covered by our dataset using ST_ConvexHullAggr.
SELECT ST_ConvexHullAggr(SHAPE) FROM MELB_HOUSING_PREDICT_VIEW
Within the polygon above, we search for pois with tags amenity, shop or airport. Based on the OSM attributes, we categorize the pois into following types: entertainment, health, education, childcare, community, emergency, grocery, shopping, airport.
Overall we downloaded 50000 pois and selected around 15000, which we found relevant based on the categorization above. We can use mapping plugins, like Folium, to create heatmaps for different poi types.
Based on the poi data, we generate a bunch of features, which describes the distance of each house to the next poi of a given type as well as the density of pois around the listing.
We can use the following SQL view to efficiently query the next poi of a given type.
CREATE OR REPLACE VIEW MELB_HOUSING_CLOSEST_POI (IN POI_TYPE VARCHAR(100)) AS ( SELECT A.ID, A.SHAPE.ST_DISTANCE(B.SHAPE_PT) AS DIST_POI FROM MELB_HOUSING_PREDICT_VIEW A LEFT JOIN ( SELECT OSMID, SHAPE_PT, ST_VoronoiCell(SHAPE_PT, 10.0) OVER () AS VCELL FROM MELB_OSM_POI_VIEW WHERE POI_TYPE = :POI_TYPE ) B ON A.SHAPE.ST_INTERSECTS(B.VCELL) = 1 )
Note, that we first (inner select statement) used the Voronoi clustering to generate Voronoi cells for each poi of a certain type. The Voronoi cell covers the area, where each point is closer to the centroid (i.e. poi) of this cell than to any other centroid (i.e. poi). If we now intersect the coordinate of the house, with the Voronoi cells, we immediately get the closest POI by retrieving the record of the cell’s centroid. This is a nice example of using Voronoi cells for a non-visual purpose.
For counting the POIs of a certain type within a given distance (e.g. 1000 meter) around a listing, we use the following SQL view.
CREATE OR REPLACE VIEW MELB_HOUSING_COUNT_POI(IN POI_TYPE VARCHAR(100), IN DISTANCE_METER INTEGER) AS ( SELECT A.ID, COUNT(*) AS CNT_POI FROM MELB_HOUSING_PREDICT_VIEW A JOIN MELB_OSM_POI_VIEW B ON A.SHAPE.ST_WITHINDISTANCE(B.SHAPE,:DISTANCE_METER) = 1 WHERE :POI_TYPE IS NULL OR B.POI_TYPE = :POI_TYPE GROUP BY A.ID )
Artificial Region Proxy
In our initial dataset we dropped columns, such as suburb, which is a geo-spatial identifier for a region. However, a suburb or district has a political or administrative dimension. The area covered may change over time. On the other hand each of us knows, that this is important information for housing prices. There are simply expensive districts, where we sometimes may get the impression of paying for an address rather than a house.
We want to avoid this administrative dimension and chose a data-driven approach to infer regions from the data. In similar examples, I have used hexagonal clustering and used the individual cells as a proxy for a neighbourhood. This time I want to limit the number of regions and at the same time infer a meaningful clustering based on the density of listings.
The following SQL view is used to generate 30 k-means clusters and the Voronoi cells of their centroids. Each area/cell is given a name containing a simple ID (REGION-xx).
CREATE OR REPLACE VIEW MELB_REGION_PROXY AS ( SELECT ST_ClusterID() AS CLUSTER_ID, 'REGION-' || ST_ClusterID() AS CLUSTER_NAME, ST_ClusterCell() AS CLUSTER_CELL FROM MELB_HOUSING_PREDICT_VIEW GROUP CLUSTER BY SHAPE USING KMEANS CLUSTERS 30 )
Areas of Interest
In each city there are certain areas of interest, where the proximity to those areas influences the housing price. For the Melbourne example, we want to use two ‘areas’:
- CBD – the Central Business District of Melbourne
- Port Phillip – the bay/coastline of Melbourne
The CBD is in the very center of the city. It is itself a large area. If we would measure proximity only to a point in the center of the city, we would run into the risk of not correctly reflecting, that the “center” of a city is usually not a point, but a larger area. For this reason we calculate the closest distance from each house to the CBD polygon – and the coastline as well, as this is certainly a driver for housing prices – by using ST_Distance with polygons (CBD) and linestrings (Coastline).
The coastline is a linestring with a rather high precision. To measure the proximity of a house to the bay, it will be sufficient to reduce the number of points of the linestring while preserving the characteristic. We can use ST_Simplify to do so. The parameter determines, that the new (reduced) coastline should not deviate more than 100 meter from the original one.
SELECT ST_GeomFromText(:bay_poly, 3857).ST_Simplify(100) FROM DUMMY
ML Model including Location Information
Now let’s bring all the non-spatial and spatial features together, to see if we can improve our housing price prediction, that we conducted earlier. We added following geo-spatial features:
- Distance to Central Business District
- Distance to Coastline/Bay
- Region (identified by id)
- Closest poi of type: entertainment, health, education, childcare, community, emergency, grocery, shopping, airport
- Number of pois of the same categories within a distance of 1000 meter.
To be clear: For the pois, we generate one feature per type (e.g. dist_entertainment, cnt_entertainment, dist_healt, cnt_healt, etc.).
The median absolute error of our new model is $135k on the training data (year 2016 and 2017) and $130k on the test data (year 2018). Conclusions:
- We are 14% off on our test dataset
- The behaviour of our model on training and test dataset is similar. In fact it performs slightly better on test.
Also let’s again look at the variable contributions.
The impact of spatial features:
- dist_cbd – the distance to the CBD is the most important feature. The closer to the center, the higher the price.
- dist_bay – distance to the bay is the second most important spatial feature. House at the bay are more expensive.
- region – region 22, 11, 29, 25, 30 tend to be more expensive, whereas 5 and 7 tend to be cheaper.
- pois – the distance to the airport as well as the density of bars and restaurants have a certain (but smaller) impact.
Comparing both Models
Well, that’s easy. Let’s get the numbers mentioned above into one table.
|MAE on Test||$300k||$130k||$170k|
|MAE Deviation Train/Test||25%||5%||20%|
Bottom line: The enhancement with spatial features significantly improved the precision and robustness of our model.
Was this surprising? By no means! Think of what would happen, if your ML model or reporting ignores the time dimension for temporal data. Eventually your results would be useless… The same happen when dealing with data, that carries a spatial dimension.
Remember that also the non-geo-spatial model had the yearbuilt as most influencing factor, which can be explained with the fact, that old buildings tend to be in the city center. In this case the yearbuilt would have acted as a proxy for geo-spatial proximity to the city center (i.e. dist_cbd).
Predicting the Price of a Reference House
We could stop here. We have shown, that the handling of spatial data is worth the effort and that in fact the effort is not as high as you would expect, if you are using HANA’s in-database handling of spatial data. That was what I wanted to show in the first place.
However, having the right model at hand, we could now be curious, what the isolated impact of the spatial dimension on the housing price is.
Given the exact same house in different areas of the city. How would the price differ?
Before we all go to bed or continue working – let’s answer this very last question. By looking at the base data we can construct the “average house”.
CREATE OR REPLACE VIEW MELB_HOUSING_PREDICT_VIEW AS ( SELECT ST_ClusterId() AS ID, 3 AS ROOMS, 'h' as TYPE, 3 as BEDROOM2, 2 as BATHROOM, 2 as CAR, 600 as LANDSIZE, 160 as BUILDINGAREA, 1965 as YEARBUILT, ST_ClusterCell().ST_Centroid() AS SHAPE, ST_ClusterCell() AS SHAPE_CELL, 2020 as YEAR FROM MELB_HOUSING GROUP CLUSTER BY SHAPE USING HEXAGON X CELLS 75 )
With this view we spanned a grid of 75 hexagonal cells in x direction across the city. We can use the records above to do a prediction for the centroid of each cell. For the data at hand, that are 1301 price predictions for the exact same house at different locations.
We can use the hexagonal cell of the prediction centroid to visualize the distribution across the city. Since we use the same data preparation as before, it considers all our spatial features (pois, area of interest and regions) while fixing the non-spatial features to static values.
In-line with what we have identified during model debriefing, we see hotspots in the center of the city as well as on the coastline. We are also able to see expensive regions, which are neither close to the city nor to the coastline. Depending on the location the price for our 160sqm house varies between 330k Australian Dollar and 2.5m Australian Dollar.
If we go down the coastline to the very south, we will see a cell, where our reference house is worth 1.9 million Australian Dollar.
I leave it up to you to check the satellite imagery, if this makes sense…. I did for sure.