In a tutorial video the SAP HANA Academy’s Tahir Hussain Babar (aka Bob) shows how to use SAP Data Services to connect to Google BigQuery. Bob will detail how to read data from Google BigQuery with SAP Data Services, how to enrich and cleanse that data in SAP Data Services and how to write the data into Google BigQuery using a SAP Data Services job.
(0:10 – 1:04) What is Google BigQuery?
Google BigQuery is a method of loading data into Google’s infrastructure that enables you to run select statements on the data while Google’s own hardware handles the processing. These commands are run using SQL-like queries that extract the data. There are many ways to connect to Google BigQuery including using a web UI, a command line tool, and/or by making direct calls to the Google BigQuery REST API using various client libraries such as JAVA, .NET and/or Python. There are numerous third party tools that can interact with Google BigQuery in many capacities such as visualizing data and loading data. One of those third party tools is SAP Data Services.
(1:04 – 4:05) How to Upload a Data Set to Google BigQuery Using the Web UI
You can access Google BigQuery from the Google Developers Console by selecting BigQuery. Click on your Dataset ID and select the option to Create new table. When you create a table you have the option of using a template from a previously uploaded schema. Bob elects to forgo a template and proceeds to ID his new table as ADDRESS_DATA while leaving the table type as native, the default. For the source format you have three options, CSV, JSON and Cloud Datastore Backup. Bob chooses CSV and uploads his file DirtyData.csv from his desktop. The file contains a list of names and dirty (incorrect) addresses.
Next you must specify the schema. It is possible to add a new field by selecting from a dropdown menu of different types and modes. Or you can also edit the schema as a text, which Bob elects to do as shown below.
The final step is to select whether or not the data set is delimited and how many header rows to skip. Once your advanced options are set, click the submit button.
(4:05 – 4:50) How to View the Data Set
Once the data set is uploaded click on the COMPOSE QUERY button at the top. To view the data, write a simple select statement such as the one shown below and then click the RUN QUERY button. There is also the option to view the data in JSON format.
(4:50 – 6:10) Importing the Data into SAP Data Services
First, in the SAP Data Services Designer, create a new data store. You must enter the information pictured below including the Authentication Server URL, Authentication Access Scope, email address, Private Key, Signature Algorithm, Web Services URL and the proxy host and port.
Once this is set you will be given access to all of the external metadata from your Google BigQuery account. In Bob’s example he has a project called saphanadspaloalto where he can access the ADDRESS_DATA table he created in Google BigQuery. Clicking Import brings the metadata into SAP Data Services.
(6:10 – 8:40) How to Create a Data Flow with the Google BigQuery Data in SAP Data Services
Create a new project and then a new job in that project. Bob’s project is called SHA_GBQ_Videos and his job is called Read_from_GBQ_job. Next create a data flow in the job. Bob’s is called Read_from_GBQ_DF. Bob then uses a SQL transform, which enables him to write SQL against the connection. Select your Google BigQuery datastore as the Datastore and then enter the same SQL statement that you would in Google BigQuery’s web UI. Clicking Update schema takes the columns from the select statement and puts them into the output of the SQL transform.
Next Bob outputs the SQL transform into a Query transform. Here is where you can decided which columns you will use from the select statement. Bob selects ID, COUNTRY, ADDRESS1, ADDRESS2, ADDRESS3, CITY, STATE and ZIPCODE.
(8:40 – 11:55) How to Cleanse the Data in SAP Data Services
From the list of Data Quality transforms Bob expands Global_Address_Cleanse and then selects USA_AddressCleanse before dragging it into his data flow. For his input Bob selects the columns he wishes to cleanse. Bob uses MULTILINE for each of the columns so that every single column is checked for an address, city, state, and/or zip code.
Next Bob keeps the default options before configuring the USA_AddressCleanse Output to better enrich the data. Bob choose PIMARY_NUMBER as the first part of his output’s formatting. For example, you could set it so every address is outputted as either Downing Street or Downing St. By selecting PRIMARY_NAME_FULL1 Bob’s will be outputted as Downing Street.
The other FEILD_NAMEs Bob chooses are LOCALITY1_NAME, which outputs the cleansed city, REGION1, which outputs the cleansed state, POSTCODE_FULL, which outputs the full zip code and COUNTRY, which will outputs United States. Bob also adds the ID column from the Query to be outputted as well.
(11:55 – 13:00) Creating the Target Table in Google BigQuery
Next back in Google BigQuery you need to create a new table. Bob names his new table CLEANSED_DATA. For the file Bob just uses a blank CSV. He uses the below text for his schema and keeps all of the default advanced options.
(13:00 – 14:55) Writing Data from SAP Data Services to Google BigQuery and Running the Job
Back in the SAP Data Services Designer Bob imports the CLEANSED_DATA table from his saphanadspaloalto project. Bob adds the CLEANSED_DATA table as the output to the data flow. In between the USA_AddressCleanse and output Bob adds yet another Query transform. Bob then maps the address cleansing fields to their corresponding output table columns.
Bob saves his project and then executes the job. The SAP Data Services job connects to Google BigQuery, preforms some transformations (enriching and cleansing the address data) in SAP Data Services and then writes the cleansed data back to Google BigQuery.
(14:55 – 16:52) Comparing the Query Results
Bob runs the select statement shown below in Goggle BigQuery to further confirm that his job was successfully run.
Then Bob runs another select statement (shown below) on the source table to demonstrate just how unclean the original data was. It was littered with spelling errors (ie. BOLAVARD) and the zip codes only had five digits.
Bob focuses his query on a single ID and compares the query results for the data from the ADDRESS_DATA and the CLEANSED_DATA tables.
Query Result for ID28114 from ADDRESS_DATA
Query Result for ID28114 from CLEANSED_DATA
By the end of the tutorial Bob has demonstrated how to connect SAP Data Services to Google BigQuery.
SAP HANA Academy – Over 1,200 free tutorials videos on SAP HANA, SAP Analytics and the SAP HANA Cloud Platform.