Migration of 30 Million Customers into C4/HANA Customer Data Cloud
SAP Customer Data Cloud (CDC), i.e Gigya customer identity and access management platform, is designed to help companies build digital relationships with their customers. Its platform allows companies to manage a customers’ profile, preference, opt-in and consent settings, with customers maintaining control of their data. Customers opt in and register via CDC’s registration-as-a-service, which addresses changing geographical privacy issues and manages compliance requirements such as the General Data Protection Regulation (GDPR).
CDC is SAP’s response to delivering a personalised omni-channel experience that leverages customer information in a transparent way that will help organisations adhere to data privacy, residency and retention legislation and comply with GDPR regulations.
This blog provides an approach on how to migrate customer data to SAP Customer Data Cloud using SAP Cloud Platform and the IdentitySync component.
Data Migration Architecture
It is possible to migrate large volumes of data from source systems into SAP Customer Data Cloud using the IdentitySync component which is CDC’c SaaS Tool for the import, transformation and export of data via SAP Cloud Platform Integration Data Services. IdentitySync is Gigya’s robust ETL solution (Extract, Transform, Load) and offers an easy way to transfer data in bulk between platforms. The digram below illustrates a high level architecture for volume migration into SAP CDC.
- Data is extracted from source systems and placed in AWS SFTP Folders.
- CPI DS reads files from the folder. Transformation or mapping rules can be applied during the read process.
- CPI DS saves the transformed data into a HANA Database.
- CPI DS reads data from HANA DB and creates the required load format for CDC import. These files are saved to an AWS SFTP folder.
- CPI PI reads the files from SFTP, splitting them into optimum batch sizes and saving back to the SFTP folder.
- SAP Customer Data Cloud reads the final files and loads the data. Logs for any failures are written back to a sperate SFTP folder for analysis.
The recommend format for load files is JSON, but it may be that extracting or transforming data into this format is not possible. In these cases, CDC will allow the import of data in a delimited text file. IdentitySync offers code templates for some of the dataflow scenarios, which can be customised to transform and parse CSV files into JSON according to the customer requirements.
Identity Sync Components
In IdentitySync, a component is a pre-configured data flow element that is used to perform a specific data integration operation. The components include readers, writers, transformers and lookups. A data flow is a series of steps, that comprises the complete definition for a transfer of information between CDC and a third-party platform. A data flow can be scheduled to run at a set interval or may be executed on an ad hoc basis.
CDC Import Data flow
Create Data Flow
The first step is to create a data flow and provide a template for the import of Lite (i.e few fields of customer profile data set) and Full accounts (i.e all fields of customer profile data set).
Using the Full Account template, we can see the basic process for the import of data:
In the template dataflow, the preferred file format is specified as JSON, this can be changed using the additional components available in the dataflow selection screen:
Parse, Evaluate and Add Fields
The file.parse.dsv step will help you to parse the CSV file and converts into standard CDC JSON Format. One limitation of the csv import process is that the import of nested arrayed data, such as telephone numbers, is not supported using the standard import template. There is no native parse DSV support to read the nested array format ( e.g. number:”+46704429999″, type:”Mobile”)
In this case, a script can be used to convert the string holding the phone number and type, into a nested JSON object.
Often, some data elements in the file may be constant values that are valid for all records that are being imported. To optimise the performance of the imports, the option ”field.add” is used to add fields to the record and insert a value to each field rather than passing it in the file.
Rename the Source System Field Names to CDC Field Names
It is most likely the case that the field names of the data being imported will not be the same as those in CDC. For the data to be imported successfully the fields must be renamed according to the CDC field names. To complete this task the dataflow option “field.rename” is available. In this step each of the column headings in the import data file must be correctly mapped to the corresponding field in CDC.
The import account step will write the records to the site in which the dataflow is created. Error handing is taken care of by the collecting failed records and writing them to a file to be saved on a dedicated location, in this case an SFTP folder.
Scheduling and Monitoring the Data Flows
When a dataflow has been completed it can be tested from the Dataflows Overview screen. Opening the Actions menu displays a set of options:
‘Run Test’ starts the dataflow and will read only the first 10 records of any import file to process. This is a quick method to iterate through the test process and fine tune your dataflow.
When the dataflow is ready to be tested with larger data volumes (more records over multiple files), the ‘Scheduler’ can be used to start the import process. Once scheduled, the dataflow will read files from the specified location in the order defined in the initial read step (typically oldest file first).
Monitoring of the flow can be done through the Status option:
Pressing the details button will open the status of the job where the progress and details of any errors can be seen:
Optimise CDC Data Flow Load Performance
The data flow will have multiple connections to the server that can be amended in the “import.account” step. The maximum number of recommended connections in parallel 20:
It is recommended to split the load files to 200 thousand rows but any single file must not exceed 2GB. Typically, you should estimate a load rate of between 200 and 300 thousand records per hour but this will vary depending upon the number of fields and the number of connections to the server.
SAP Customer Data Cloud provides and robust and comprehensive set of tools to assist in the migration of large data volumes on to it’s platform. With the help of SAP Data Services or CPI – Data Services, an efficient ETL process can be designed to ensure the smooth transfer of data from your legacy systems to CDC. As with any migration project, there will be challenges to overcome so the following hints may be worth bearing in mind:
- Data Quality – As with any data migration, quality of data is vitally important. CDC has validation rules for fields such as telephone and emails. If a field does not meet these validation rules, it will result in record failure
- Understand Your Schema – Before agreeing any file formats, understand your CDC schema and how the data is structured, particularly data arrays. This will influence how simple or complicated your import process will be.
- Date formats – Use the export process available in CDC to review the formats for any date fields. Dates are often stored in full ISO-8601 date/time with Coordinated Universal Time (UTC).
- Null Values – Passing a null or empty value can sometimes lead to unexpected errors, particularly in date fields. Wherever possible remove empty values using the ‘record.evaluate’ step.