Introduction and background of technologies
Amazon recently introduced an interesting big data product called Athena. In this post I will go through how to connect a BusinessObjects universe with it.
What is Amazon Athena?
Athena is a query service that makes it possible to access data stored on Amazon S3 using standard SQL. (https://aws.amazon.com/athena/details/)
Probably the most distinctive features of Athena is that it is serverless. In practice it means that:
- no need to configure and initiate virtual machines,
- no cost for allocated CPU and memory
Amazon manages all these, and charges only based on the queries that you actually run.
- $5 per TB of data scanned
In order to put this price in perspective; and to simulate a real word example: I have generated call detail records (well, sort of) of an imaginary telecom company with about 5 million active users, initiating about 5-10 calls/users/per day. I would like to analyze the last 90 days without pre-aggregation. So I ended up with more than 3 billion records, or about 12-15 GB compressed.
With only 5$ you can probably run all the meaningful analytical queries on the full dataset. (60-80 queries) But if you partition and query your data wisely, you can get much more for your money. If the WHERE condition narrows down the “data to scan” to a single partition, you will only pay for the size of that partition.
In case you would like to stick with your on-premise hardware and software, Athena is based on Presto. https://prestodb.io/
This step-by-step guide is going to be a bit long for a blog post, but detailed in return, so bear with me. The steps are going to be:
- Loading and defining the database
- Setting up the JDBC driver
- Defining the Athena JDBC in BusinessObjects
- Creating a relational connection in IDT
- One more thing…
Loading and defining the database
At first you need to create an AWS account (https://console.aws.amazon.com/console/home) and create an S3 bucket (https://aws.amazon.com/s3). Note that Athena is not yet available in all regions, only at: US East (Northern Virginia),US East (Ohio), US West (Oregon), Europe (Ireland). It’s best to create the S3 bucket in one of these as well.
Once you have a bucket, you can upload your uncompressed data in CSV with simple drag and drop, or also with SAP BO DataServices.
If you have your data on S3, you may already define your database schema in Athena, and immediately start querying the dataset. (https://aws.amazon.com/athena)
Athena uses Hive DDL. Here’s an example table definition:
CREATE EXTERNAL TABLE CDR (CALL_TYPE String,CALL_RESULT String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://[bucket name]/cdr'
To achieve better performance and lower price, I recommend converting the plain CSV to a column based and compressed format, for example Parquet.
For this part you need to step outside of Athena and initiate a EMR cluster with Hive. I will not go into details, but briefly the steps:
- In Hive you need to reference the simple table,
- Create a new empty table in Parquet format
- With an insert statement load the data into parquet
The DDL of a parquet table:
CREATE EXTERNAL TABLE CDR_PARQUET (CALLING_PARTY int, RECIEVING_PARTY int, CALL_TYPE String, CALL_RESULT String, CALL_DURATION_SECS int) PARTITIONED BY (START_DATE int) STORED AS PARQUET LOCATION 's3://[bucket name]/parquet/cdr'
Setting up the JDBC driver
Download the JDBC driver from Amazon:
- Driver: https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.1.0.jar
- Documentation: http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
Before going further, as a baseline you may try to access your Athena database with SQL Workbench. Please find the detailed documentation on the above link.
In order to use the JDBC driver with Information Design Tool, you would also need to add the Amazon SSL certificate to the Java runtime. (SQL workbench doesn’t require this, but IDT would throw an error if it can’t find the certificate in the keystore.) You need to add the certificate to the java keystore with keytool.
Get the Amazon SSL certificate:
- Open the URL: https://athena.eu-west-1.amazonaws.com/
- Open the Developer tools in Chrome
- Go to Security tab
- View certificate
- Details tab
- Copy to file
- Select Base64
Next you need to find the Java Runtime Environment (JRE) paths. All of them: SAP and non-SAP, 32 and 64bit. For example: C:\Program Files (x86)\Java\jre1.8.0_121\bin
Once you located them, create a directory under each \bin as \certs and copy the Amazon certification there.
To import these certificates you need to use the keytool from command prompt. An example script:
set drive=C: set rootdir=C:\Program Files (x86)\SAP BusinessObjects\SAP BusinessObjects Enterprise XI 4.0 set JREdir=C:\Program Files (x86)\Java\jre1.8.0_121 set SAPJVM32dir=%rootdir%\win32_x86\sapjvm\jre set SAPJVM64dir=%rootdir%\win64_x64\sapjvm\jre set SAPJVM64_7dir=%rootdir%\win64_x64\sapjvm_7\jre set CERTdir=%JREdir%\bin\certs set CERT=amazon_base64_x509.cer set ALIAS=Amazon %drive% CD %JREdir%\bin keytool -importcert -keystore "%JREdir%\lib\security\cacerts" -storepass changeit -alias Amazon -file "%CERTdir%\%CERT%" -noprompt CD %SAPJVM32dir%\bin keytool -importcert -keystore "%SAPJVM32dir%\lib\security\cacerts" -storepass changeit -alias Amazon -file "%CERTdir%\%CERT%" -noprompt CD %SAPJVM64dir%\bin keytool -importcert -keystore "%SAPJVM64dir%\lib\security\cacerts" -storepass changeit -alias Amazon -file "%CERTdir%\%CERT%" -noprompt CD %SAPJVMdir%\bin keytool -importcert -keystore "%SAPJVM64_7dir%\lib\security\cacerts" -storepass changeit -alias Amazon -file "%CERTdir%\%CERT%" -noprompt pause
Note that you should get “Certificate was added to keystore” four times! Unortunately it won’t throw a loud error message if it fails.
Defining the Athena JDBC in BusinessObjects
Copy the Athena JDBC driver to: C:\Program Files (x86)\SAP BusinessObjects\SAP BusinessObjects Enterprise XI 4.0\dataAccess\connectionServer\drivers\java
Locate the amazon.sbo file at: C:\Program Files (x86)\SAP BusinessObjects\SAP BusinessObjects Enterprise XI 4.0\dataAccess\connectionServer\jdbc
Add the following definition to the Databases node, similarly to the default Amazon Redshift.
<DataBase Active="Yes" Name="Amazon Athena JDBC"> <JDBCDriver> <ClassPath> <Path>$ROOT$/drivers/java/AthenaJDBC41-1.1.0.jar</Path> </ClassPath> <Parameter Name="JDBC Class">com.amazonaws.athena.jdbc.AthenaDriver</Parameter> <Parameter Name="URL Format">jdbc:awsathena://$DATASOURCE$</Parameter> </JDBCDriver> <Parameter Name="Array Fetch Size">10</Parameter> <Parameter Name="Family">Amazon</Parameter> <Parameter Name="Driver Capabilities">Query</Parameter> <Parameter Name="s3_staging_dir">s3://[your bucket name]/</Parameter> </DataBase>
This needs to be done on the server and on the client machine of the IDT.
Creating a relational connection in IDT
Server (host:port): athena.eu-west-1.amazonaws.com:443
JDBC driver properties: s3_staging_dir=s3://[bucket]/athena_query_results
If all went well, we have a successfull connection!
One more thing…
Not the good type though. There’s an issue: inserting a table with IDT on the data foundation fails.
I could not solve this straight, but found the following work-around:
- create the same database schema in another database (tested on SQL Server)
- create a connection and the data foundation
- replace the connection to the previously created Athena one
- right click the imported tables and Chage Quilifier/Owner
- set the Quilifier as your Athena database name, and leave the owner blank
This setup connects SAP BusinessObjects’ leading enterprise BI capabilties with one of the best and most affordable big data services.