Skip to Content
Author's profile photo Tom Flanagan

[SAP HANA Academy] Discover How to Connect SAP Data Services to Google BigQuery

In a tutorial video the SAP HANA Academy’s Tahir Hussain Babar (aka Bob) shows how to use SAP Data Services to connect to Google BigQuery. Bob will detail how to read data from Google BigQuery with SAP Data Services, how to enrich and cleanse that data in SAP Data Services and how to write the data into Google BigQuery using a SAP Data Services job.

Screen Shot 2015-12-03 at 11.31.36 AM.png

(0:10 – 1:04) What is Google BigQuery?


Google BigQuery is a method of loading data into Google’s infrastructure that enables you to run select statements on the data while Google’s own hardware handles the processing. These commands are run using SQL-like queries that extract the data. There are many ways to connect to Google BigQuery including using a web UI, a command line tool, and/or by making direct calls to the Google BigQuery REST API using various client libraries such as JAVA, .NET and/or Python. There are numerous third party tools that can interact with Google BigQuery in many capacities such as visualizing data and loading data. One of those third party tools is SAP Data Services.


For additional detailed information about Google BigQuery please check out Lofan Abrams’ post: SAP Data Services builds momentum with BigQuery


(1:04 – 4:05) How to Upload a Data Set to Google BigQuery Using the Web UI


You can access Google BigQuery from the Google Developers Console by selecting BigQuery. Click on your Dataset ID and select the option to Create new table. When you create a table you have the option of using a template from a previously uploaded schema. Bob elects to forgo a template and proceeds to ID his new table as ADDRESS_DATA while leaving the table type as native, the default. For the source format you have three options, CSV, JSON and Cloud Datastore Backup. Bob chooses CSV and uploads his file DirtyData.csv from his desktop. The file contains a list of names and dirty (incorrect) addresses.


Next you must specify the schema. It is possible to add a new field by selecting from a dropdown menu of different types and modes. Or you can also edit the schema as a text, which Bob elects to do as shown below.

Screen Shot 2015-12-03 at 8.32.45 AM.png

The final step is to select whether or not the data set is delimited and how many header rows to skip. Once your advanced options are set, click the submit button.


(4:05 – 4:50) How to View the Data Set


Once the data set is uploaded click on the COMPOSE QUERY button at the top. To view the data, write a simple select statement such as the one shown below and then click the RUN QUERY button. There is also the option to view the data in JSON format.

Screen Shot 2015-12-03 at 9.02.06 AM.png

(4:50 – 6:10) Importing the Data into SAP Data Services


First, in the SAP Data Services Designer, create a new data store. You must enter the information pictured below including the Authentication Server URL, Authentication Access Scope, email address, Private Key, Signature Algorithm, Web Services URL and the proxy host and port.

Screen Shot 2015-12-03 at 9.10.21 AM.png

Once this is set you will be given access to all of the external metadata from your Google BigQuery account. In Bob’s example he has a project called saphanadspaloalto where he can access the ADDRESS_DATA table he created in Google BigQuery. Clicking Import brings the metadata into SAP Data Services.


(6:10 – 8:40) How to Create a Data Flow with the Google BigQuery Data in SAP Data Services


Create a new project and then a new job in that project. Bob’s project is called SHA_GBQ_Videos and his job is called Read_from_GBQ_job. Next create a data flow in the job. Bob’s is called Read_from_GBQ_DF. Bob then uses a SQL transform, which enables him to write SQL against the connection. Select your Google BigQuery datastore as the Datastore and then enter the same SQL statement that you would in Google BigQuery’s web UI. Clicking Update schema takes the columns from the select statement and puts them into the output of the SQL transform.

Screen Shot 2015-12-03 at 9.28.12 AM.png

Next Bob outputs the SQL transform into a Query transform. Here is where you can decided which columns you will use from the select statement. Bob selects ID, COUNTRY, ADDRESS1, ADDRESS2, ADDRESS3, CITY, STATE and ZIPCODE.

Screen Shot 2015-12-03 at 9.43.24 AM.png

(8:40 – 11:55) How to Cleanse the Data in SAP Data Services


From the list of Data Quality transforms Bob expands Global_Address_Cleanse and then selects USA_AddressCleanse before dragging it into his data flow. For his input Bob selects the columns he wishes to cleanse. Bob uses MULTILINE for each of the columns so that every single column is checked for an address, city, state, and/or zip code.

Screen Shot 2015-12-03 at 10.26.51 AM.png

Next Bob keeps the default options before configuring the USA_AddressCleanse Output to better enrich the data. Bob choose PIMARY_NUMBER as the first part of his output’s formatting. For example, you could set it so every address is outputted as either Downing Street or Downing St. By selecting PRIMARY_NAME_FULL1 Bob’s will be outputted as Downing Street.


The other FEILD_NAMEs Bob chooses are LOCALITY1_NAME, which outputs the cleansed city, REGION1, which outputs the cleansed state, POSTCODE_FULL, which outputs the full zip code and COUNTRY, which will outputs United States. Bob also adds the ID column from the Query to be outputted as well.

Screen Shot 2015-12-03 at 10.49.10 AM.png

(11:55 – 13:00) Creating the Target Table in Google BigQuery


Next back in Google BigQuery you need to create a new table. Bob names his new table CLEANSED_DATA. For the file Bob just uses a blank CSV. He uses the below text for his schema and keeps all of the default advanced options.

Screen Shot 2015-12-03 at 10.53.27 AM.png

(13:00 – 14:55) Writing Data from SAP Data Services to Google BigQuery and Running the Job


Back in the SAP Data Services Designer Bob imports the CLEANSED_DATA table from his saphanadspaloalto project. Bob adds the CLEANSED_DATA table as the output to the data flow. In between the USA_AddressCleanse and output Bob adds yet another Query transform. Bob then maps the address cleansing fields to their corresponding output table columns.

Screen Shot 2015-12-03 at 11.02.05 AM.png

Bob saves his project and then executes the job. The SAP Data Services job connects to Google BigQuery, preforms some transformations (enriching and cleansing the address data) in SAP Data Services and then writes the cleansed data back to Google BigQuery.


(14:55 – 16:52) Comparing the Query Results


Bob runs the select statement shown below in Goggle BigQuery to further confirm that his job was successfully run.

Screen Shot 2015-12-03 at 11.26.20 AM.png

Then Bob runs another select statement (shown below) on the source table to demonstrate just how unclean the original data was. It was littered with spelling errors (ie. BOLAVARD) and the zip codes only had five digits.

Screen Shot 2015-12-03 at 11.27.51 AM.png

Bob focuses his query on a single ID and compares the query results for the data from the ADDRESS_DATA and the CLEANSED_DATA tables.

Screen Shot 2015-12-03 at 11.29.47 AM.png

Query Result for ID28114 from ADDRESS_DATA

Screen Shot 2015-12-03 at 11.30.14 AM.png

Query Result for ID28114 from CLEANSED_DATA


By the end of the tutorial Bob has demonstrated how to connect SAP Data Services to Google BigQuery.


SAP HANA Academy – Over 1,200 free tutorials videos on SAP HANA, SAP Analytics and the SAP HANA Cloud Platform.


Follow us on Twitter @saphanaacademy and connect with us on LinkedIn to stay abreast of our latest free tutorials.

Assigned Tags

      1 Comment
      You must be Logged on to comment or reply to a post.
      Author's profile photo Leandro Fonseca
      Leandro Fonseca

      Hi

      How do you ads BigQuery as a Datatore type?