The GDELT Data

abdel_dadouche · ‎04-12-2017

Following the #sitFRA Demo Recap blog, here is the second part of the series.

This part will focus on how to import the data used in our demo from the GDELT database.

The GDELT Data

(source:http://gdeltproject.org/)

In this blog part, we will explain how to import the data into SAP HANA and SAP Vora.

But, first a quick recap of the dataset we have used.

The data we used for the demo were retrieved from the Global Database of Events, Language, and Tone data (GDELT) version 1.0 from the GDELT project.

The GDELT 1.0 event dataset comprises over 3.5 billion mentions of over 364 million distinct events from almost every corner of the earth spanning January 1979 to present and updated daily.

The file format documentation describes the various fields and their structure.

In addition, tab-delimited lookup files are available that contain the human-friendly textual labels for each of those codes to make it easier to work with the data for those who have not previously worked with CAMEO.

Lookups are available for both Event Codes and the Goldstein Scale.

In addition, details recording characteristics of the actors involved in each event are stored as a sequence of 3 character codes. Lookups are available for Event Country Codes, Type Codes, Known Group Codes, Ethnic Codes and Religion Codes.

And you can download the complete collection as a series of Event CSV files here:

http://data.gdeltproject.org/events/index.html

You can find more details about GELT v1 here:

http://gdeltproject.org/data.html#documentation

In our scenario, we decided to use SAP HANA to host the "dimension" data and SAP Vora to host the events data.

Importing the data in SAP HANA

The easiest and fastest way to import small data files like the lookups table from GDELT to SAP HANA is probably via the "Import Data From Local File" feature from SAP HANA Studio (or the Eclipse plugin for SAP HANA).

Download the following files locally with a "csv" file extension:

Event Codes,

Goldstein Scale,

Country Codes

But first, let's create the table structure to avoid any import issue (the import feature allows you to create the table on the fly, but you may run into data type issues which we want to avoid for you).

Open an "SAL Console" and execute the following script:

CREATE SCHEMA DEMO_HANA;



CREATE COLUMN TABLE "DEMO_HANA"."COUNTRY" (

	"CODE" NVARCHAR(3) NOT NULL,

	"LABEL" NVARCHAR(50),

	PRIMARY KEY ("CODE")

) UNLOAD PRIORITY 5 AUTO MERGE ;



CREATE COLUMN TABLE "DEMO_HANA"."EVENTCODE" (

	"CAMEOEVENTCODE" NVARCHAR(5) NOT NULL,

	"EVENTDESCRIPTION" NVARCHAR(100),

	PRIMARY KEY ("CAMEOEVENTCODE")

) UNLOAD PRIORITY 5 AUTO MERGE ;



CREATE COLUMN TABLE "DEMO_HANA"."GOLDSTEINSCALE" (

	"CAMEOEVENTCODE" NVARCHAR(5) NOT NULL,

	"GOLDSTEINSCALE" DECIMAL(6,1) CS_FIXED,

	PRIMARY KEY ("CAMEOEVENTCODE")

) UNLOAD PRIORITY 5 AUTO MERGE ;

Then, to start the import wizard, use the "File > Import" menu, then search for "Import Data From Local File" under "SAP HANA Content".

Click on "Next", then pick your "Target System".

On the next dialog, use the browse button to pick your file, then make sure you apply the the following settings:

Set "Field delimiter" to "Tab"

Check "Header Row"

Check "Import all data"

Select "Existing" for the "Target Table", then using the "Select table" button pick the corresponding table

Then click on "Next".

On the next screen, use the menu with the "Map by name" option:

Then you can click on "Finish" and repeat the operation for the other files.

Importing the data in SAP Vora

Since everything was running on my laptop, we didn't import the full set of data files from GDELT into SAP Vora, but just the last 15 days of event.

You can start with the download of a few Event CSV files from here to your local machine:

http://data.gdeltproject.org/events/index.html

Then you can transfer the files to your SAP Vora virtual machine (using either the "shared folder" feature of your "hypervisor" or via FTP).

You can for example copy the files into a directory like: /home/vora/zip

Note: for those familiar with wget or curl, you can download the files directly to your virtual machine (but make sure your proxy is properly configured if you use one).

Also I recommend you to transfer the unzipped version of the files, so it will be faster, but then you will need to unzip them using them with a command like this:

mkdir /home/vora/csv

unzip "/home/vora/zip/*.zip" -d /home/vora/csv

The uncompressed files are "CSV", but they use tabs as delimiters, and are not using quotes to delimit fields, which will not ease the import in the disk engine (it works fine with the relational engine however).

So, you can run the following command to transform the csv files into a more standard format:

for i in /home/vora/csv/*.CSV;  do sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' "$i" > "${i/.CSV}".csv; done

Make sure you replace the "<tab>" by an actual tab (using CTRL+V, Tab key).

Then, we need to import the data into Hadoop HDFS with the following command (executed as the vora user:

sudo hdfs dfs -mkdir /user/hdfs

sudo hdfs dfs -mkdir /user/hdfs/csv

sudo hdfs dfs -put /home/vora/csv/*.csv /user/hdfs/csv

sudo hdfs dfs -chmod -R 777 /user/hdfs

We used "sudo" here, because Hadoop is currently installed and running as the "root" user.

Now that the files are on the Hadoop HDFS, you can run the following SQL command from the Vora tools SQL Editor to create the tables:

as a "disk engine" table:

DROP TABLE IF EXISTS EVENTS_DE;



CREATE TABLE EVENTS_DE (

	GLOBALEVENTID BIGINT,

	SQLDATE INTEGER,

	MONTHYEAR INTEGER,

	YYEAR INTEGER,

	FRACTIONDATE VARCHAR(20),

	ACTOR1CODE VARCHAR(20),

	ACTOR1NAME VARCHAR(255),

	ACTOR1COUNTRYCODE VARCHAR(3),

	ACTOR1KNOWNGROUPCODE VARCHAR(3),

	ACTOR1ETHNICCODE VARCHAR(3),

	ACTOR1RELIGION1CODE VARCHAR(3),

	ACTOR1RELIGION2CODE VARCHAR(3),

	ACTOR1TYPE1CODE VARCHAR(3),

	ACTOR1TYPE2CODE VARCHAR(3),

	ACTOR1TYPE3CODE VARCHAR(3),

	ACTOR2CODE VARCHAR(15),

	ACTOR2NAME VARCHAR(255),

	ACTOR2COUNTRYCODE VARCHAR(3),

	ACTOR2KNOWNGROUPCODE VARCHAR(3),

	ACTOR2ETHNICCODE VARCHAR(3),

	ACTOR2RELIGION1CODE VARCHAR(3),

	ACTOR2RELIGION2CODE VARCHAR(3),

	ACTOR2TYPE1CODE VARCHAR(3),

	ACTOR2TYPE2CODE VARCHAR(3),

	ACTOR2TYPE3CODE VARCHAR(3),

	ISROOTEVENT INTEGER,

	EVENTCODE VARCHAR(4),

	EVENTBASECODE VARCHAR(3),

	EVENTROOTCODE VARCHAR(2),

	QUADCLASS INTEGER,

	GOLDSTEINSCALE DECIMAL(1,1),

	NUMMENTIONS INTEGER,

	NUMSOURCES INTEGER,

	NUMARTICLES INTEGER,

	AVGTONE DOUBLE,

	ACTOR1GEO_TYPE INTEGER,

	ACTOR1GEO_FULLNAME VARCHAR(255),

	ACTOR1GEO_COUNTRYCODE VARCHAR(2),

	ACTOR1GEO_ADM1CODE VARCHAR(5),

	ACTOR1GEO_LAT DECIMAL(9,4),

	ACTOR1GEO_LONG DECIMAL(9,4),

	ACTOR1GEO_FEATUREID VARCHAR(10),

	ACTOR2GEO_TYPE INTEGER,

	ACTOR2GEO_FULLNAME VARCHAR(255),

	ACTOR2GEO_COUNTRYCODE VARCHAR(3),

	ACTOR2GEO_ADM1CODE VARCHAR(5),

	ACTOR2GEO_LAT DECIMAL(9,4),

	ACTOR2GEO_LONG DECIMAL(9,4),

	ACTOR2GEO_FEATUREID VARCHAR(10),

	ACTIONGEO_TYPE INTEGER,

	ACTIONGEO_FULLNAME VARCHAR(255),

	ACTIONGEO_COUNTRYCODE VARCHAR(3),

	ACTIONGEO_ADM1CODE VARCHAR(5),

	ACTIONGEO_LAT DECIMAL(9,4),

	ACTIONGEO_LONG DECIMAL(9,4),

	ACTIONGEO_FEATUREID VARCHAR(50),

	DATEADDED INTEGER,

	SOURCEURL VARCHAR(255)

) USING 

	com.sap.spark.engines.disk

OPTIONS (

	storagebackend "hdfs",

	files "/user/hdfs/csv/20170318.export.csv",

	format "csv",

    tableName "EVENTS_DISK",

    tableSchema "GLOBALEVENTID BIGINT, SQLDATE INTEGER, MONTHYEAR INTEGER, YYEAR INTEGER, FRACTIONDATE VARCHAR(20), ACTOR1CODE VARCHAR(20), ACTOR1NAME VARCHAR(255), ACTOR1COUNTRYCODE VARCHAR(3), ACTOR1KNOWNGROUPCODE VARCHAR(3), ACTOR1ETHNICCODE VARCHAR(3), ACTOR1RELIGION1CODE VARCHAR(3), ACTOR1RELIGION2CODE VARCHAR(3), ACTOR1TYPE1CODE VARCHAR(3), ACTOR1TYPE2CODE VARCHAR(3), ACTOR1TYPE3CODE VARCHAR(3), ACTOR2CODE VARCHAR(15), ACTOR2NAME VARCHAR(255), ACTOR2COUNTRYCODE VARCHAR(3), ACTOR2KNOWNGROUPCODE VARCHAR(3), ACTOR2ETHNICCODE VARCHAR(3), ACTOR2RELIGION1CODE VARCHAR(3), ACTOR2RELIGION2CODE VARCHAR(3), ACTOR2TYPE1CODE VARCHAR(3), ACTOR2TYPE2CODE VARCHAR(3), ACTOR2TYPE3CODE VARCHAR(3), ISROOTEVENT INTEGER, EVENTCODE VARCHAR(4), EVENTBASECODE VARCHAR(3), EVENTROOTCODE VARCHAR(2), QUADCLASS INTEGER, GOLDSTEINSCALE DECIMAL(1,1), NUMMENTIONS INTEGER, NUMSOURCES INTEGER, NUMARTICLES INTEGER, AVGTONE DOUBLE, ACTOR1GEO_TYPE INTEGER, ACTOR1GEO_FULLNAME VARCHAR(255), ACTOR1GEO_COUNTRYCODE VARCHAR(2), ACTOR1GEO_ADM1CODE VARCHAR(5), ACTOR1GEO_LAT DECIMAL(9,4), ACTOR1GEO_LONG DECIMAL(9,4), ACTOR1GEO_FEATUREID VARCHAR(10), ACTOR2GEO_TYPE INTEGER, ACTOR2GEO_FULLNAME VARCHAR(255), ACTOR2GEO_COUNTRYCODE VARCHAR(3), ACTOR2GEO_ADM1CODE VARCHAR(5), ACTOR2GEO_LAT DECIMAL(9,4), ACTOR2GEO_LONG DECIMAL(9,4), ACTOR2GEO_FEATUREID VARCHAR(10), ACTIONGEO_TYPE INTEGER, ACTIONGEO_FULLNAME VARCHAR(255), ACTIONGEO_COUNTRYCODE VARCHAR(3), ACTIONGEO_ADM1CODE VARCHAR(5), ACTIONGEO_LAT DECIMAL(9,4), ACTIONGEO_LONG DECIMAL(9,4), ACTIONGEO_FEATUREID VARCHAR(50), DATEADDED INTEGER, SOURCEURL VARCHAR(255)"

);



SELECT * FROM EVENTS_DE LIMIT 10;

as a "relational" table:

DROP TABLE IF EXISTS EVENTS_RE;



CREATE TABLE EVENTS_RE(

	GLOBALEVENTID BIGINT,

	SQLDATE INTEGER,

	MONTHYEAR INTEGER,

	YYEAR INTEGER,

	FRACTIONDATE VARCHAR(20),

	ACTOR1CODE VARCHAR(20),

	ACTOR1NAME VARCHAR(255),

	ACTOR1COUNTRYCODE VARCHAR(3),

	ACTOR1KNOWNGROUPCODE VARCHAR(3),

	ACTOR1ETHNICCODE VARCHAR(3),

	ACTOR1RELIGION1CODE VARCHAR(3),

	ACTOR1RELIGION2CODE VARCHAR(3),

	ACTOR1TYPE1CODE VARCHAR(3),

	ACTOR1TYPE2CODE VARCHAR(3),

	ACTOR1TYPE3CODE VARCHAR(3),

	ACTOR2CODE VARCHAR(15),

	ACTOR2NAME VARCHAR(255),

	ACTOR2COUNTRYCODE VARCHAR(3),

	ACTOR2KNOWNGROUPCODE VARCHAR(3),

	ACTOR2ETHNICCODE VARCHAR(3),

	ACTOR2RELIGION1CODE VARCHAR(3),

	ACTOR2RELIGION2CODE VARCHAR(3),

	ACTOR2TYPE1CODE VARCHAR(3),

	ACTOR2TYPE2CODE VARCHAR(3),

	ACTOR2TYPE3CODE VARCHAR(3),

	ISROOTEVENT INTEGER,

	EVENTCODE VARCHAR(4),

	EVENTBASECODE VARCHAR(3),

	EVENTROOTCODE VARCHAR(2),

	QUADCLASS INTEGER,

	GOLDSTEINSCALE DECIMAL(3,1),

	NUMMENTIONS INTEGER,

	NUMSOURCES INTEGER,

	NUMARTICLES INTEGER,

	AVGTONE DOUBLE,

	ACTOR1GEO_TYPE INTEGER,

	ACTOR1GEO_FULLNAME VARCHAR(255),

	ACTOR1GEO_COUNTRYCODE VARCHAR(2),

	ACTOR1GEO_ADM1CODE VARCHAR(5),

	ACTOR1GEO_LAT DECIMAL(9,4),

	ACTOR1GEO_LONG DECIMAL(9,4),

	ACTOR1GEO_FEATUREID VARCHAR(10),

	ACTOR2GEO_TYPE INTEGER,

	ACTOR2GEO_FULLNAME VARCHAR(255),

	ACTOR2GEO_COUNTRYCODE VARCHAR(3),

	ACTOR2GEO_ADM1CODE VARCHAR(5),

	ACTOR2GEO_LAT DECIMAL(9,4),

	ACTOR2GEO_LONG DECIMAL(9,4),

	ACTOR2GEO_FEATUREID VARCHAR(10),

	ACTIONGEO_TYPE INTEGER,

	ACTIONGEO_FULLNAME VARCHAR(255),

	ACTIONGEO_COUNTRYCODE VARCHAR(3),

	ACTIONGEO_ADM1CODE VARCHAR(5),

	ACTIONGEO_LAT DECIMAL(9,4),

	ACTIONGEO_LONG DECIMAL(9,4),

	ACTIONGEO_FEATUREID VARCHAR(50),

	DATEADDED INTEGER,

	SOURCEURL VARCHAR(255)

) USING 

	com.sap.spark.vora

OPTIONS (

	storagebackend "hdfs",

	files "/user/hdfs/csv/20170318.export.csv",

	format "csv",

	schema "GLOBALEVENTID BIGINT, SQLDATE INTEGER, MONTHYEAR INTEGER, YYEAR INTEGER, FRACTIONDATE VARCHAR(20), ACTOR1CODE VARCHAR(20), ACTOR1NAME VARCHAR(255), ACTOR1COUNTRYCODE VARCHAR(3), ACTOR1KNOWNGROUPCODE VARCHAR(3), ACTOR1ETHNICCODE VARCHAR(3), ACTOR1RELIGION1CODE VARCHAR(3), ACTOR1RELIGION2CODE VARCHAR(3), ACTOR1TYPE1CODE VARCHAR(3), ACTOR1TYPE2CODE VARCHAR(3), ACTOR1TYPE3CODE VARCHAR(3), ACTOR2CODE VARCHAR(15), ACTOR2NAME VARCHAR(255), ACTOR2COUNTRYCODE VARCHAR(3), ACTOR2KNOWNGROUPCODE VARCHAR(3), ACTOR2ETHNICCODE VARCHAR(3), ACTOR2RELIGION1CODE VARCHAR(3), ACTOR2RELIGION2CODE VARCHAR(3), ACTOR2TYPE1CODE VARCHAR(3), ACTOR2TYPE2CODE VARCHAR(3), ACTOR2TYPE3CODE VARCHAR(3), ISROOTEVENT INTEGER, EVENTCODE VARCHAR(4), EVENTBASECODE VARCHAR(3), EVENTROOTCODE VARCHAR(2), QUADCLASS INTEGER, GOLDSTEINSCALE DECIMAL(3,1), NUMMENTIONS INTEGER, NUMSOURCES INTEGER, NUMARTICLES INTEGER, AVGTONE DOUBLE, ACTOR1GEO_TYPE INTEGER, ACTOR1GEO_FULLNAME VARCHAR(255), ACTOR1GEO_COUNTRYCODE VARCHAR(2), ACTOR1GEO_ADM1CODE VARCHAR(5), ACTOR1GEO_LAT DECIMAL(9,4), ACTOR1GEO_LONG DECIMAL(9,4), ACTOR1GEO_FEATUREID VARCHAR(10), ACTOR2GEO_TYPE INTEGER, ACTOR2GEO_FULLNAME VARCHAR(255), ACTOR2GEO_COUNTRYCODE VARCHAR(3), ACTOR2GEO_ADM1CODE VARCHAR(5), ACTOR2GEO_LAT DECIMAL(9,4), ACTOR2GEO_LONG DECIMAL(9,4), ACTOR2GEO_FEATUREID VARCHAR(10), ACTIONGEO_TYPE INTEGER, ACTIONGEO_FULLNAME VARCHAR(255), ACTIONGEO_COUNTRYCODE VARCHAR(3), ACTIONGEO_ADM1CODE VARCHAR(5), ACTIONGEO_LAT DECIMAL(9,4), ACTIONGEO_LONG DECIMAL(9,4), ACTIONGEO_FEATUREID VARCHAR(50), DATEADDED INTEGER, SOURCEURL VARCHAR(255)"

);





SELECT * FROM EVENTS_RE LIMIT 10;

If you want to load all your files, you can replace:

files "/user/hdfs/csv/20170318.export.csv",

by:

files "/user/hdfs/csv/*",

And off course, you can now see your table in the Vora Modeler and start querying your local data!

Now let’s play with the data, and for that here is the link:

#sitFRA Demo Recap – Query the data

#sitFRA Demo Recap – SAP HANA && SAP Vora Yin and Yang of SAP Data Management - Import the data

The GDELT Data

Importing the data in SAP HANA

Importing the data in SAP Vora

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win