How safe is cycling in London? Spatial analysis wi...

kevin_small · ‎01-20-2015

If you've ever visited London you'll no doubt have seen those courier cyclists weaving in and out of traffic. You might have seen some of them hurtling through red lights, mounting the pavement (or sidewalk for American readers) and frightening pedestrians. Cycling in general is being encouraged by the London city mayor. Being a risk-averse scaredy-cat myself I got to wondering how safe it is to cycle in London? It certainly doesn't look very safe, so I set about attempting to quantify how safe - or otherwise - it is using HANA's spatial capabilities in SP8.

The UK government now publish road traffic accident data in a very convenient form. They provide spreadsheets containing 1.4 million accidents spanning 13 years. I used some of this data previously to allow accidents near your location to be viewed as heatmaps on your mobile phone. Back then the accident data needed a license and additional work was necessary to get locations, so the data is much easier to use now.

This article will show you how to get the necessary accident data, tidy it up using Linux tools, then perform some spatial analysis using SQL all in an attempt to measure: how safe cycling is in London?

For the impatient, I can reveal that the data shows no increase in average accident severity if you're involved in a cycling accident in London vs the rest of the UK. Good news for cyclists, then. Perhaps this is because average speeds in the capital are lower, or perhaps my amateur statistics analysis is flawed. But I'm getting ahead of myself. Let's take this step by step:

1) Define the question we want to answer

2) Source the Data and Understand the Data Model

3) Do the Extract, Transformation and Load (ETL) into HANA

4) Analyse Data using HANA SP8 Spatial capabilities

1) Define the question we want to answer

Ideally we'd like to know "how safe is cycling in London?". To answer that, it would be reasonable to say what is the probability of a cycling journey having an accident. That would mean we'd need to know how many journeys there are, including those that were incident free, and more about their nature (distance, weather, lighting). Data about accident-free journeys seems not so easy to get. Since we're not working to anyone's spec, let's redefine the question that we will answer to suit the data available. How about:

If you cycle in London and are in an accident, is it likely to be more serious than in the rest of the UK?

The above is much easier to answer with the available data.

2) Source the Data and Understand the Data Model

The source data is available here: http://data.gov.uk/dataset/road-accidents-safety-data. The data model is very simple, as shown below:

The tables of interest are Accidents and Vehicles. The Accidents table holds the worst case severity for an accident and it's location. The Vehicles table holds all vehicles involved and the field Vehicle_Type = 1 identifies cycles. In fact, that is all that we need. This is enough to let us filter on cycles only and average the severity over "in London" and "not in London" buckets based on the accident location.

The last thing to consider here is what sort of spatial model we want to use in HANA. HANA offers 3 spatial reference systems (SRS) in SP8, each identified by a Spatial Reference ID (SRID):

Most of the articles you see use SRID 4326 (aka the "WGS84" system) which is the system used by GPS and it treats the earth as a sphere. However in HANA SP8, SRID 4326 does not allow tests of points being contained by shapes. So in this article we're going to use SRID 1000004326. This effectively takes the earth as a globe from SRID 4326 and stretches it out onto a flat planar surface. Distances are distorted (look at Antartica in the south of the above diagram) but tests of "does A contain B" are possible.

Suitable HANA tables to hold the data can be found in this GitHub repository, see the RoadF.hdbdd file.

Ok, now we have understood the data model, we're ready to get some data into the system.

3) Do the Extract, Transform and Load (ETL) into HANA

The data is supplied as three tables, one spreadsheet per table. The data will make a journey through process steps a) to e) as shown below:

a) Use PSCP to Upload Files

First download the files http://data.gov.uk/dataset/road-accidents-safety-data to your local PC. Next we need to get the files onto the HANA box. One method for this I'd not seen till recently is using a tool called PSCP. The HANA system for this demo was a CAL instance and if you use a CAL instance then you may already use PuTTY to connect to the backend Linux host. When you install PuTTY on a Windows PC you also get a secure FTP client called PSCP. PSCP can read configurations you've setup in PuTTY and so it is quite convenient to use to FTP files.

Let's use PSCP from a DOS command line to list some files on the Linux host. In windows run PSCP from the DOS command line like this:

C:\Program Files (x86)\PuTTY>pscp -ls "HANAonBW":/usr/sap/HDB/home/

In the above, the -ls is the Linix command to list files, the "HANAonBW" is a saved PuTTY config to allow us to login and the /usr/sap/HSB/home/ is the directory on the Linux box. The PuTTY configs are those you you see in the "Saved Session" in PuTTY here:

Now we are familiar with PSCP, it is easy to do the file transfer. The syntax is like this: pscp <source file from Windows> <PuTTY config>:<target directory on Linux>:

C:\Program Files (x86)\PuTTY>pscp c:\temp\RTA\AccidentsF.csv "HANAonBW":/usr/sap/HDB/home/

b) Use AWK to convert date format

The data provided contains a date in a format that is not compatible with HANA. We need to change the format of the date from 14/01/2005 to the HANA format 2015-01-14. To do this, we're going to use a tool that comes with Linux called AWK. To do this conversion, we use PuTTY to connect to the Linux backend, then run this series of Linux commands:

This runs pretty quickly, around 8 seconds to convert 1.4 million rows. Taking each line in turn:



// line count of source file for checking

wc -l < AccidentsF.csv

// Change field 10 be HANA style date

awk -F"," '{OFS=","; $10=substr($10,7,4)"-"substr($10,4,2)"-"substr($10,1,2); print $0}' AccidentsF.csv > AccidentsFT.csv

// line count of target file for checking

wc -l < AccidentsFT.csv = 1494276

The AWK command is quite long and deserves a bit more elaboration:

-F"," - to use a field separator of comma

{ - begin actions

OFS=","; - Output Field Separator, so that the output fields are also separated by a ,

$10= - field number 10 will equal...

substr($10,7,4)"-"substr($10,4,2)"-"substr($10,1,2); - some strings manipulated to make new date

print $0; - to print out the whole line to the target file.

} - end actions

AccidentsF.csv - the input file

> - is sent to

AccidentsFT.csv - the T transformed file

The result of the above is a new file called "Accidents FT.csv", with the date formatted as 2015-01-14.

c) Upload via CTL files

This is covered already on SCN, and so following the same principles: use PuTTY, move to directory /usr/sap/HDB/home/ and type:

cat > roadaccF.ctl

import data

into table "ROAD"."roadsafety.data::RoadF.AccF"

from '/usr/sap/HDB/home/AccidentsFT.csv'

record delimited by '\n'

field delimited by ','

optionally enclosed by '"'

error log '/usr/sap/HDB/home/AccidentFTErr.txt'

Use [CTRL+D] to end creating the CTL file. Then upload the data using SQL from insiide HANA studio (the necessary target table definitions are in GitHub here😞

IMPORT FROM '/usr/sap/HDB/home/roadaccF.ctl';

d) Data Cleansing

Now need a tiny bit of data cleansing. The data contains a few records without location data, and these need removed:

delete from "ROAD"."roadsafety.data::RoadF.AccFT" where LAT is null and LON is null;

e) Add Spatial Field & Fill it

To add the spatial field to our Accidents table, we cannot yet add that in the .hdbdd file in HANA Studio, instead we have to manually add the field with SQL. To do this I followed the article by rafael.babar and copied the data from the existing table to a new table, populating the Spatial Point in the process. The SQL to do this in Github.

4) Analyse Data using HANA SP8 Spatial capabilities

I spent much time trying to get HANA models to work, SQLScript calculation views to work, and various errors occurred. The recommendation seems to be to wait for SP9 for fuller integration with HANA models. Therefore I used pure SQL to do the analysis of the data.

Before doing any SQL, we need to define what "in London" and "outside London" means. For this I followed jon-paul.boyd's excellent blog and used Google Earth to draw a polygon around the area I was interested in:

That polygon is then exported as a series of coordinates which is used in the SQL statement below. Finally we're ready for some analysis! This SQL returns the mean severity and variance of all accidents in London that involved a bicycle:

-- Just bicyles in London

select

AVG (T0."SEVERITY") "Avg Severity",

VAR (T0."SEVERITY") "Avg Severity Variance",

SUM (T0."VEHCOUNT"),

SUM (T0."CASCOUNT")

from

ROAD."roadsafety.data::RoadF.AccFT" T0

left outer join "ROAD"."roadsafety.data::RoadF.VehF" T1

on T0."ACCREF" = T1."ACCREF"

where T1."VEHTYPE" = 1 --vehtype 1 is bicycle

and NEW ST_Polygon('Polygon((

51.69021545178133 -0.1000795593465265,

51.68262625218747 -0.1640894678953375,

51.62673844314873 -0.5003652550731252,

51.4687978441449 -0.5003020713080952,

51.37537638345922 -0.2604447782463681,

51.29248664506417 -0.1217913673590465,

51.3298782058117 -0.02055237147410183,

51.32142023464126 0.0993682688423303,

51.34618151800474 0.1346959279977478,

51.46491093248794 0.2133695972971839,

51.54192930978628 0.3296565877570212,

51.62542509952219 0.228648947683745,

51.60811732442631 0.0851277551187013,

51.67901853300752 -0.01341248134237749,

51.69021545178133 -0.1000795593465265

))').ST_Contains("LATLON") > 0; -- use = 0 for outside London

The results are, for cycling accidents inside and outside London:

Results

Location Severity Mean Severity Variance

Inside London 2.86076 0.12851

Outside London 2.81853 0.16485

Remember that lower severity is more serious. Severity is measured as an integer where 1 = fatal, 2 = serious and 3 = minor. The results suggest it is better to be involved in an accident inside London because the average severity value is higher (less serious). Perhaps this is because car speeds are slower.

The next question is, are the above results statistically significant? Could the difference be by chance alone, or does it reveal a pattern in the underlying data? This is beyond my (very) amateur statistics knowledge, and although there are plenty samples online about "comparing the mean of two populations" they all focus on taking samples from large populations where variance is not known but here we know every data point. If anyone with statistics knowledge reads this, I'd be interested to know how to go about comparing these means.