Hi All,
I’m sorry but this isn’t going to be an exciting blog about some of HANA’s cool features.
It’s a bit boring but the Bread & Butter of all databases is having tables with data, that can be queried and joined.
For those of you just starting or those that are more adventurous and want a large volume of data to test out performance tuning options, then TPC Benchmark™H (TPC–H) might be for you.
TPC-H provides 8 tables and a data generation tool that enables you to create as much data as you want.
Generate the test data, create the tables, import and you’re away.
The 8 tables are all designed to be joined. They simulate Sales Order transactions:
The tables include (using SAP style terminology):
- Customers (Customer Master)
- Supplier (Vendor Master)
- Part (Material Master records)
- Partsupp (Inventory Balances)
- Nation (Country)
- Region
- Orders (Sales Order Header)
- Lineitem (Sales Order line items)
NOTE: The tables are NOT in the same format as the similar SAP ECC tables but offer similar information as what you might find in ECC.
The Table Schema is as follows:
NOTE: SF is the Scaling factor used to generate the data.
For example a scaling factor of 1 will generate 6 Million records for the lineitem table, SF 2 will generate 12 Million, 10 will generate 60M,
100 will generate 600M, etc.
As a quick summary you will need to :
- Log into Linux with the appropriate authorization
- Install a C/C++ compiler if you’ve not already
- Download/Extract the TPC-H files
- Modify the ‘makefile’ to be relevant for Linux
- Compile the data generation tool (DBGEN)
- Run DBGEN to create data files based on the scaling factor
- Create the tables and import them in HANA studio
If you’re still reading then I’ve not scared you off, so now for the technical bit.
Log into Linux with the appropriate authorization
Use Putty or equivalent tool to log into Linux box running HANA , at the OS level.
For those using a AWS HANA developer box refer to http://scn.sap.com/docs/DOC-28294
Install a C/C++ compiler if you’ve not already
sudo zypper install gcc gcc-c++ gcc-fortran
cd /usr/bin/
sudo ln -s gcc g
Download/Extract the TPC-H files
Create a new directory to store TPC-H files, ensure there is enough space on the mounted folder then download and extract the files.
On AWS I have choosen the following:
cd /sap
mkdir tpc-h
cd tpc-h
wget http://www.tpc.org/tpch/spec/tpch_2_15.0.zip
unzip tpch_2_15.0.zip
Modify the ‘makefile’ to be relevant for Linux
Use ‘vi’ or equivalent to modify the supplied ‘makefile’ which is used to compile the data generator tool.
In the ‘makefile’ ensure the following parameters are set
DATABASE=SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
NOTE: We use SQLSERVER above as it doesn’t recognise HANA.
For an intro to ‘vi’ you could try http://heather.cs.ucdavis.edu/~matloff/UnixAndC/Editors/ViIntro.html
At the command prompt:
cd tpch_2_15.0/dbgen
vi makefile.suite
Edit the file and save.
Compile the data generation tool (DBGEN)
Now we need to compile the generation tool, to generate the executable file ‘dbgen’
make -f makefile.suite
The dbgen executable should now appear in the folder.
Check with.
ls -l
Run DBGEN to create data files based on the scaling factor
Now you are ready to create ‘tbl’ files for each of the 8 tables.
The important parameter of the ‘dbgen’ program is the scaling factor.
A scaling factor of 1, will generate 1Gb of uncompressed test data(e.g. 6 million rows in the ‘lineitem table’)
A scaling factor of 2, will generate 2Gb of uncompressed test data(e.g. 12 million rows in the ‘lineitem table’)
A scaling factor of 10, will generate 60Gb of uncompressed test data(e.g. 60 million rows in the ‘lineitem table’)
A scaling factor of 100, will generate 600Gb of uncompressed test data(e.g. 600 million rows in the ‘lineitem table’)
etc.
Excute the tool with one of following commands withe different scaling factors (or your own):
#Scaling factor 1
./dbgen -vf -s 1
#Scaling factor 2
./dbgen -vf -s 2
#Scaling factor 10
./dbgen -vf -s 10
[Note: I suggest your try scaling factor 1 first, till you are comfortable with the load process]
Create the tables and import them in HANA studio
create schema “TPCH”;
set schema “TPCH”;
drop table nation;
drop table region;
drop table supplier;
drop table part;
drop table partsupp;
drop table customer;
drop table orders;
drop table lineitem;
create column table nation (
n_nationkey decimal(3,0) not null,
n_name char(25) not null,
n_regionkey decimal(2,0) not null,
n_comment varchar(152)
);
create column table region (
r_regionkey decimal(2,0) not null,
r_name char(25) not null,
r_comment varchar(152)
);
create column table part (
p_partkey decimal(10,0) not null,
p_name varchar(55) not null,
p_mfgr char(25) not null,
p_brand char(10) not null,
p_type varchar(25) not null,
p_size decimal(2,0) not null,
p_container char(10) not null,
p_retailprice decimal(6,2) not null,
p_comment varchar(23) not null
);
create column table supplier (
s_suppkey decimal(8,0) not null,
s_name char(25) not null,
s_address varchar(40) not null,
s_nationkey decimal(3,0) not null,
s_phone char(15) not null,
s_acctbal decimal(7,2) not null,
s_comment varchar(101) not null
);
create column table partsupp (
ps_partkey decimal(10,0) not null,
ps_suppkey decimal(8,0) not null,
ps_availqty decimal(5,0) not null,
ps_supplycost decimal(6,2) not null,
ps_comment varchar(199) not null
);
create column table customer (
c_custkey decimal(9,0) not null,
c_name varchar(25) not null,
c_address varchar(40) not null,
c_nationkey decimal(3,0) not null,
c_phone char(15) not null,
c_acctbal decimal(7,2) not null,
c_mktsegment char(10) not null,
c_comment varchar(117) not null
);
create column table orders (
o_orderkey decimal(12,0) not null,
o_custkey decimal(9,0) not null,
o_orderstatus char(1) not null,
o_totalprice decimal(8,2) not null,
o_orderdate date not null,
o_orderpriority char(15) not null,
o_clerk char(15) not null,
o_shippriority decimal(1,0) not null,
o_comment varchar(79) not null
);
create column table lineitem (
l_orderkey decimal(12,0) not null,
l_partkey decimal(10,0) not null,
l_suppkey decimal(8,0) not null,
l_linenumber decimal(1,0) not null,
l_quantity decimal(2,0) not null,
l_extendedprice decimal(8,2) not null,
l_discount decimal(3,2) not null,
l_tax decimal(3,2) not null,
l_returnflag char(1) not null,
l_linestatus char(1) not null,
l_shipdate date not null,
l_commitdate date not null,
l_receiptdate date not null,
l_shipinstruct char(25) not null,
l_shipmode char(10) not null,
l_comment varchar(44) not null
);
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/lineitem.tbl’ INTO “TPCH”.“LINEITEM” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/customer.tbl’ INTO “TPCH”.“CUSTOMER” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/nation.tbl’ INTO “TPCH”.“NATION” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/orders.tbl’ INTO “TPCH”.“ORDERS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/part.tbl’ INTO “TPCH”.“PART” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/partsupp.tbl’ INTO “TPCH”.“PARTSUPP” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/region.tbl’ INTO “TPCH”.“REGION” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
IMPORT FROM CSV FILE ‘/sap/tpc-h/tpch_2_15.0/dbgen/supplier.tbl’ INTO “TPCH”.“SUPPLIER” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’;
Congratulations!
You are now ready to start running queries.
Section 2.4 of the following document http://www.tpc.org/tpch/spec/tpch2.15.0.pdf has plenty of query examples for you to try.
You may need to translate these to HANA SQL or be more adventurous and create them with Attribute, Analytic & Calculation views to optimise performance.
NOTE: For those that successfully translate the queries to HANA SQL then please consider adding your translated SQL as a comment to this blog, to share with others.
Have fun bench-marking your results with others.
——————————————————————————————
BTW: I used TPC-H data in my first blog comparing HANA to HADOOP IMPALA.