Skip to Content
Technical Articles

How Much Storage Am I Using in the SAP HANA Cloud Data Lake?

The HANA data lake provides efficient storage and analysis capabilities for large amounts of data.  Storage efficiency is provided using a combination of inexpensive cloud storage and significant data compression.  For example, we commonly see 10x or more compression for CSV format data stored in the data lake and even for big data formats like parquet there is an average 2x compression factor.

One thing an SAP HANA Cloud data lake administrator might want to know is exactly how much storage is in use within the data lake. Knowing this can help with capacity planning activities and cost projections related to operation of the SAP HANA Cloud data lake.

Before I get too far, I want to invite you to sign up for the SAP HANA Cloud trial if you haven’t already registered. With that being said, let’s get started.

To figure out how much storage you are using in your HANA data lake, you can do some calculations based on your row size and number of rows in a table, but due to compression, indices, and meta-information, it is not likely to be very accurate.  This makes it hard to do any real capacity planning or cost estimations for storage in your data lake.

Fortunately, there are already ways to get this information for an on-premise IQ system, and since SAP IQ forms the core of the SAP HANA Cloud data lake, we are able to make these features available as part of the data lake.  Why haven’t we done this already?  There are a large number of features in SAP IQ, and in order to avoid inadvertent issues with the service, we want to validate there will be no issues with the data lake.

Two stored procedures have been exposed in the most recent update of data lake that enable you to determine how much space has been taken up by a table.  They are called sp_iqtablesize() and sp_iqindexsize() and they do exactly what you would expect – return the amount of storage in use for the specified table or index.

Unfortunately, using these procedures is not exactly straightforward (yet).  Currently, in the data lake, in order to see the results of a stored procedure, you have to create a view over the procedure call, like this:

call "SYSRDL#CG".REMOTE_EXECUTE('
    CREATE VIEW VTBLSIZE AS
        SELECT * FROM dbo.sp_iqtablesize( 'myschema.mytable' )
' );

Then, you must create a virtual table against the view in order to see the results:

CREATE VIRTUAL TABLE HDL_VTABLESIZE AT "SYSRDL#CG_SOURCE"."NULL"."SYSRDL#CG"."VTBLSIZE";

After that, you can get the table size information for the table specified by querying the virtual table:

SELECT * FROM HDL_VTABLESIZE;

I know and I agree – this is a painful process. However, it will get easier in the near future with some new SAP HANA Cloud features. These features will be enabled for the data lake soon and will allow automated creation of a data lake table and a virtual table in the same statement (CREATE VIRTUAL TABLE … WITH REMOTE) and a new procedure (called remote_execute_query() ) which will return a result set directly form a native data lake query.  Of course, we should also expect to see this type of storage usage data eventually available through the SAP HANA Cloud UI as well.  Stay tuned!

 

Another drawback to the above example is that you have to create/recreate the view and virtual table for every data lake table you want to get sizing information about.  We can deal with this in two ways.

1. Fetching Sizing Data for all Tables

We can use a view which will retrieve the sizing data for all tables in the data lake.  The following view definition will execute the sp_iqtablesize() procedure for all tables in the data lake:

call "SYSRDL#CG".REMOTE_EXECUTE('
    CREATE OR REPLACE VIEW VTBLSIZE AS
        SELECT OwnerName, TableName, KBytes
        FROM SYS.SYSTAB t KEY JOIN SYS.SYSUSER u,
            dbo.sp_iqtablesize(string(u.user_name, ''.'', t.table_name))
        WHERE u.user_name = CURRENT_USER and t.table_type in (1,2)
');

Then, you need to create a virtual table to retrieve the data from the view (this is the same as the above virtual table definition):

CREATE VIRTUAL TABLE HDL_VTABLESIZE AT "SYSRDL#CG_SOURCE"."NULL"."SYSRDL#CG"."VTBLSIZE";

To get the sizing data for a specific table, we can simply add a where clause when selecting from the virtual table.  For example, to get the sizing data for the table T, execute:

SELECT * FROM HDL_TABLESIZE WHERE TableName = 'T';

The nice thing about this query is that you can query the view without a where clause and get the sizing data for all of the data lake tables, and you can use sum(KBYTES) to get an overall total of storage in the data lake.

 

2. Using Variables

This method is slightly more complicated but is useful in a variety of scenarios involving getting administrative information out of the data lake. We can leverage the broad feature set of the data lake to make this a little bit easier by using a variable in our view.  This avoids executing the sizing procedure against all tables when we may only want the data for a single table.

To make the original view usable for a tablename specified at execution time, we can create it like this:

call "SYSRDL#CG".REMOTE_EXECUTE('
    CREATE OR REPLACE VIEW VTBLSIZE AS SELECT * FROM dbo.sp_iqtablesize( ''SYSRDL#CG.'' + tblname )
');

In the above view definition, tblname is a variable used to hold the name of the table you wish to get sizing information about which is provided at execution time.

Then, you need to create a virtual table to retrieve the data from the view (this is the same as the above virtual table definition):

CREATE VIRTUAL TABLE HDL_VTABLESIZE AT "SYSRDL#CG_SOURCE"."NULL"."SYSRDL#CG"."VTBLSIZE";

Next, when you are ready to get the sizing information for a specific table (or index), you first setup a variable in the data lake by doing the following on your connection:

call "SYSRDL#CG".REMOTE_EXECUTE('
    CREATE OR REPLACE VARIABLE tblname varchar(32);
        SELECT ''T'' INTO tblname;
');

Then you can query the view to get the results:

SELECT * FROM HDL_TABLESIZE;

This will return the sizing data for the table T.

To get the data for a different table, update the variable and re-run the select statement.

I hope you found this all useful. Your comments are welcome!

 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.