Technical Articles
Test data generator for SAP HANA for large initial and delta data sets
Introduction
Recently I ran a scenario on the SAP HANA database for which I needed a large initial data set for a specific table, and another large delta data set consisting of deletions, updates and inserts on the initial data set. I googled something like “data generator for HANA” but didn’t find what I was looking for. Therefore, I wrote a database procedure to fill my table with the data that I needed. Because others might find the script useful too, I extended the script for easy re-use. The result you can find in this blog.
The procedure inserts, updates, or deletes data of a single table, using mostly set-based operations for high-performance. Inserts and updates are based on the field definition of the provided table and contain randomized data. For example, for an integer field, a random integer value is inserted, and for a character-based field, a random string of characters is inserted. The procedure uses the HANA metadata tables to retrieve the table definition. Record updates or deletions are randomized based on the first column of the primary key.
On my little test system, I achieved an insert rate of 9 million records per minute for a table with just a few fields. That was a lot faster than the first version of the script, which was not set-based.
How to run the data generator procedure
Follow below steps to generate data for your table
- Dowload the code in this Git repository and copy it into the SQL console
- Create a sequence with following statement: “CREATE SEQUENCE GENERATEDATASEQ1”. A sequence with that name is used in the procedure.
- Call the procedure using the following statement:
CALL "GenerateData" (
IP_SCHEMA_NAME => '<SCHEMA_NAME>',
IP_TABLE_NAME => '<TABLE_NAME>',
IP_INSERT_NUM_RECORDS => <Number of records to be inserted>,
IP_MAX_STRING_SIZE => <Maximum inserted string size. Strings are automatically limited on field length>,
IP_DELETE_NUM_RECORDS => <Number of records to delete>,
IP_UPDATE_NUM_RECORDS => <Number of records to update>
);
Limitations
Keep in mind the following limitations, or adjust the code to work around them:
- Only supports tables with a primary key. The primary key should contain at least one column that can hold integer values. Binary data types are not supported at all.
- Created with HANA 2, SPS04, not guaranteed to work with older service packs.
- The random generated values make compression difficult. Therefore, you will likely see minimal compression that is not to be compared with real-life use cases.
Example
As an example, a table is created with the following statement:
CREATE SCHEMA GENERATEDATA;
SET SCHEMA GENERATEDATA;
CREATE TABLE T1S (
A INTEGER,
B NVARCHAR(5),
C DATE,
D TIMESTAMP,
PRIMARY KEY (A)
);
To insert two records into that table, the procedure is called as follows:
CALL "GenerateData"(
IP_SCHEMA_NAME => 'GENERATEDATA',
IP_TABLE_NAME => 'T1S',
IP_INSERT_NUM_RECORDS => 2,
IP_MAX_STRING_SIZE => 3,
IP_DELETE_NUM_RECORDS => 0,
IP_UPDATE_NUM_RECORDS => 0
);
A select * on that table yields the following result:
Explanation of the code
In case you want to adjust the code or just understand what happens, an example table definition and procedure call is provided in the header section of the procedure. Throughout the code, output of (parts of) the dynamic SQL statements are shown, so it is easier to understand the, sometimes, long statements.
Troubleshooting
In case you run into an error that contains “incorrect syntax near “FROM””, check if you entered the right schema and table name as parameter for the procedure call.
Conclusion
This blog provided you with a procedure to generate initial and delta data sets. Since the inserts are set-based, it should insert, update and delete at reasonable speeds. There are some limitations, and the code is not perfect, so if you got any improvements, feel free to share them in the comments.
Hi Stefan,
Thanks for this really helpful post!
I built a simple shell script which executes your procedure 20 times to create 20 tables with 50 million records each with a user I called GENDATA. With the below configuration I got 4.7GB per table in memory and on disk and each run took about 15 minutes (so it took roughly 3 minutes to generate 1GB).
Maybe this is useful for someone else, so I leave it here:
wrapper.sh
n=1
while [ $n -le 20 ]
do
echo “=====Run: $n=====”
hdbsql -U GENDATA “DROP SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE TABLE TABLE$n (A INTEGER, B NVARCHAR(20), C DATE, D TIMESTAMP, E NVARCHAR(20), F NVARCHAR(20), PRIMARY KEY (A) );”
hdbsql -U GENDATA “CALL \”GenerateData\” (IP_SCHEMA_NAME => ‘GENDATA’, IP_TABLE_NAME => ‘TABLE$n’, IP_INSERT_NUM_RECORDS => 50000000, IP_MAX_STRING_SIZE => 20, IP_DELETE_NUM_RECORDS => 0, IP_UPDATE_NUM_RECORDS => 0);”
n=$(( n+1 ))
done
works, thanks a ton! To create a user GENDATA you must
create user GENDATA password passW0rd
Then hdbsql -U GENDATA will not allow for queries, but you do not need to use GENDATA to call the script, just its schema must exist. So after creating the user GENDATA, you can use hdbsql -u SYSTEM -p mypassword instead of hdbsql -U GENDATA.
Hi,
nice work, I can see that this all will work, but the first sentence struck me: Paste it into the SQL Console. I do not have HANA Studio nor HANA Cockpit, where is the SQL Console?
thanks
Thorsten
found the answer, if you do not have an SQL Console, you can download the code from github as text file myfile.txt and run it in a sidadm shell like this:
hdbsql -u SYSTEM -p mypassword -I myfile.txt
Great, I created my first stored procedure 🙂
Works for me, thanks a lot!
Users who do not have graphical tools can use hdbsql. To query the table, use
hdbsql -u SYSTEM -p mypassword "select * from GENERATEDATA.T1S"|cat
Downloaded the code from git hub. While calling Procedure "GenerateData". Hitting the below error.
Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'DATA1', IP_TABLE_NAME => 'Test', IP_INSERT_NUM_RECORDS => 12, ...' in 149 ms 315 µs .
SAP DBTech JDBC: [257]: sql syntax error: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7942): incorrect syntax near "FROM": line 4 col 6 (at pos 59)
Need help. New to sql, sap hana.
@Pooja Shetty You probably forgot to create sequence.
In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call. If the procedure can't find the table that you created, the metadata fetch fails and the resulting INSERT query will be incorrect.
@Pooja Shetty I got exactly the same error. It's probably worth to know that i'm running HANA 2.0 SPS6 (2.00.060.00.1637829839).
Sefan Linders Is it possible that the procedure uses a SQL statement here which are not longer valid? It's written using SPS4, while i'm trying to get it running on SPS6
cheers,
Thomas
In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call.
Thanks Sefan,
your answer was very helpful. The procedure itself did run using the correct schema - but the tables i've created beforehand didn't.
I fixed that in my database, now it's running perfectly!
Hi Sefan,
I ran into the error below while inserting 10,000,000 rows with this procedure:
Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', IP_INSERT_NUM_RECORDS ...' in 6:53.173 minutes .
[314]: numeric overflow: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7947): Failed in "PROZS_PN" column with the value 1000.00
Column PROZS_PN is defined as follows:
Inserting 100,000 rows didn't raise this issue, but failed with 500,000 and 1,000,000 rows each. Any idea?
BR,
Jinsol
Plus, I found an improvement point. If I execute this procedure with a certain schema set on client session, it fails with an error below:
SET SCHEMA SAPABAP1;
CALL "GenerateData" (
IP_SCHEMA_NAME => 'SAPABAP1',
IP_TABLE_NAME => 'ACDOCA',
IP_INSERT_NUM_RECORDS => 200000,
IP_MAX_STRING_SIZE => 2,
IP_DELETE_NUM_RECORDS => 0,
IP_UPDATE_NUM_RECORDS => 0
);
Could not execute 'CALL "GenerateData" ( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', ...' in 8 ms 349 µs .
SAP DBTech JDBC: [288]: cannot use duplicate table name: "SYSTEM"."GenerateData": line 71 col 3 (at pos 3498): #SURROGATE: line 1 col 37 (at pos 36)
FYI, my testing was done on HANA 2 Rev 64.
BR,
Jinsol