A closer look at tables that contain LOB data type columns
Based on SAP HANA SPS 10
Since SPS06, SAP HANA offers a flexibility around how it manages data type LOB (Large Object Types) columns, namely Hybrid LOBs. Data types, CLOB, NCLOB and BLOB are used to store a large amount of data such as text documents and images. The current maximum size for an LOB on SAP HANA is 2GB.
In the following blog, we will go through some examples of how Hybrid LOBS are defined, specifically around the memory threshold value and the effect that has on keeping the LOB column data on disk versus allowing it to be loaded into memory.
In-memory databases like SAP HANA, ideally you would only want to load data into memory that is actually required for processing. As SAP HANA will not compress LOB columns regardless of whether it resides in disk or in-memory, it is essential that any possible compression algorithm logic (e.g. gzip) are applied at the application layer on writing/reading from the database.
As it is expected that LOB column data will be more than likely stored on disk and temporarily loaded into memory for processing (depending on memory threshold value, 1000 bytes by default), it is worth doing specific analysis and sizing on your tables that contain LOB data columns to see what works best in your environment.
Please check the CREATE TABLE or ALTER TABLE commands for the available options when creating LOB columns.
CREATE TABLE – SAP HANA SQL and System Views Reference – SAP Library
The following guideline is key for the memory threshold value.
- If <memory_threshold_value> is not provided, a hybrid lob with memory threshold 1000 is created as default.
- If <memory_threshold_value> is provided and its LOB size is bigger than memory threshold value, LOB data is stored on disk.
- If <memory_threshold_value> is provided and its LOB size is equal or less than memory threshold value, LOB data is stored in memory.
- If <memory_threshold_value> is NULL, all LOB data is stored in memory.
- If <memory_threshold_value> is 0, all LOB data is stored in disk.
Create Table with LOB column
From the reference guide on data types, Data Types – SAP HANA SQL and System Views Reference – SAP Library
Let’s try it out, create a simple table with 3 columns
CREATE COLUMN TABLE "HOLLAS"."SH_TEST_XML" ( "XML_ID" VARBINARY(16) CS_RAW NOT NULL , "LINE_NO" NVARCHAR(10) DEFAULT '0000000000' NOT NULL , "XML_STRING" NCLOB, PRIMARY KEY ( "XML_ID", "LINE_NO"))
As we left the memory_threshold_value blank on the create, we expect the default to have the 1000 bytes value. Lets confirm..
(Aside, it wasn’t altogether obvious what table/column I would find the value for memory_threshold, but it’s always good to check table_columns or view_columns for column_name (in this case like ‘%MEM%THRES%’), to see what tables/views may have a column with this name).
I had a similar table with xml string values to hand, so for this example, I inserted 10K rows into our test xml table. The original XML string value had been truncated into 1kb row segments, with the last row in the group holding the overflow value as below.
So for our example, I wanted to unload the data, alter the threshold value to 0, reload the table and confirm that the LOB column data does not reside in memory.
unload "HOLLAS"."SH_TEST_XML"; alter table "HOLLAS"."SH_TEST_XML" alter ("XML_STRING" NCLOB MEMORY THRESHOLD 0); load "HOLLAS"."SH_TEST_XML" all;
Looking at the definition of the table post Load, the memory consumption level is as expected very small compared to overall size of the table, we presume here that the first two non-LOB data columns are loaded into memory. As we had set the memory threshold value to 0, this forced the LOB column rows of the table to move out of memory and stored on disk.
Now, let’s set it back to the default of 1000bytes, our expectation now is that the memory consumption here will only marginally increase, based on the fact that only 37 rows out of the 10k rows had a length < 1000.
Statement ‘Select x.*, length(xml_string) from “HOLLAS”.”SH_TEST_XML” as x where length(xml_string) < 1000’
successfully executed in 485 ms 503 µs (server processing time: 272 ms 18 µs)
Fetched 37 row(s) in 111 ms 384 µs (server processing time: 0 ms 299 µs)
unload "HOLLAS"."SH_TEST_XML"; alter table "HOLLAS"."SH_TEST_XML" alter ("XML_STRING" NCLOB MEMORY THRESHOLD 1000); load "HOLLAS"."SH_TEST_XML" all;
Refreshing the table definition, the memory consumption increase is marginal as expected.
Aside: One thing we’ve noticed with tables with LOB columns, the disk size space shown above is well above the initial size, it appears to increment after every alter command, as if it’s keeping a copy of the table before & after, anyway not entirely sure why that is showing up like that, but not going to spend time on it now, near the end of this blog 🙂
One more test, let’s set the threshold above the max value for the LOB column size, in this case 1024bytes.
unload "HOLLAS"."SH_TEST_XML"; alter table "HOLLAS"."SH_TEST_XML" alter ("XML_STRING" NCLOB MEMORY THRESHOLD 1024); load "HOLLAS"."SH_TEST_XML" all;
As expected, the entire table is now loaded into memory, as the table definition is reflected below.
Just a note on the hybrid LOB container values, the idea with the container or group of LOB column values, I believe it is possible to have a small LOB table be part of the same container, and get loaded in/out of memory at the same time. I was not able to prove this out, but would welcome some documentation links on how to do same.
You can see the container value information for the tables in M_TABLE_LOB_FILES.
What I like about Hybrid LOBs memory threshold approach.
- It gives you a flexibility to control whether the LOB columns are to be loaded into memory or not. In the case above, this was definitely based on character size, I did not explore if this comparison value changes say for a blob column type (e.g. to binary size).
What I didn’t like
- Although the documentation is as always excellent, there was very little discussion or examples give on the table with LOB columns. This is the main reason I put this blog together.
- Limiting by just a threshold value isn’t very flexible, it would be good if it was a % based on LRU type approach.
- To change the threshold value requires an alter command that in turn will lock the table. For a transaction table, this may require downtime, if it’s a large table (.e.g 100s of millions of rows), then it may take hours to change the parameter value. In the case where you increase the memory threshold, It will automatically load the data into memory, which may increase the length of time the alter statement will take to execute.
Store big size tables with BLOB on disk instead… | SCN
I really like that you indicated the version you tried this on.
What I don't quite understand: you've got a NCLOB field that can store up to 2GB of data and still you chop up your XML document into 1K blocks. Why? This way you end up with a lot more processing and data transfer effort than you would if you simply store the whole XML document in the field of a single row.
My second question is about your idea to have the threshold setting based on some percentage and LRU mechanism. What is your use case for that?
Also, keep in mind that with this approach, the total amount of data in your table defines what data would be kept in memory and which on disk. This kind of dynamic is pretty unpredictable - not sure you want that.
hi Lars, thanks for the taking the time to read and critique the blog.
On the xml data, this actually came from a standard CRM table, when we went live we had missed a configuration step to compress our SSC xml data, so it was stored in this format (group segments x 1kb rows), more details here.
We had to manually convert the existing un-compressed xml data, a lot of the pain points mentioned above came out of the manual conversion process. The threshold was of little benefit to us in this case, as 99% of the data was above the default 1000 bytes, so we let a long running ABAP batch program execute over a few days to do the conversion.
As we've now enabled compression of our SSC data and ran our conversion, the complete xml data set is now stored in one row as you suggested above.
On the second question, the current threshold value is also dependent on the data in the table, and is dynamic depending on the contents of the table. I believe an option of a % and/or LRU approach would be as useful (or useless 🙂 ), a possible use case is a catalog containing images, where the most frequently viewed stay in memory. Or in our case, configuration xml data that is currently been worked on or analysed stayed in memory, but over time this goes out to disk when replaced by more recent current configurations.
Changing the threshold also triggered a load/unload into memory (depending on if you reduced/increased the threshold value), during which time the table was locked for processing. It would be good if we had an option not to have this load/unload to automatically happen, but the table could adjust itself over time based on regular transaction data.
"On the second question, the current threshold value is also dependent on the data in the table, and is dynamic depending on the contents of the table."
That's not correct. The threshold works on the size of the LOB data in a specific column and is independent on the overall table data.
So "short" data always stays in memory (if the whole column doesn't get unloaded). "Long" gets loaded selectively into memory and stays there until it gets replaced.
While there is afaik no prioritisation (like LRU list) for the "long" data items, this still should lead to the effect that data that is often in use is not unloaded too often.
Data that is currently used never gets unloaded anyway.
Concerning the locking behaviour: nearly all DDL statements require exclusive locks and the common assumption of course is that this doesn't happen too often.
For the examples where DDL can be done in an online fashion, there typically is a "price" involved like decreased performance, higher memory consumption, etc.
Your experiences with the conversion from one data model to another are quite typical I'd say - especially when the process and all the side effects (especially on production systems) are not completely clear before the process.
Thanks again for sharing them here!
hi Lars, just so I understand this, when the threshold is set against a LOB column in the table, it is a fixed value (e.g. default 1000 bytes). While this is not dynamic, the contents of the LOB column can and more than likely will be different in size, so I guess my point around the portion of the the LOB column gets loaded into memory is dependent on the contents of that column.
I thought I had read somewhere that LOB data over the threshold will only be loaded into memory temporarily during processing, but thanks for clearing that up with the "long" reference above.
Correct, the threshold is a fixed value. For the developer this means, that you know in advance, that any values below the threshold will be in memory. Longer values might be, but likely require disk access.
So, the storage handling is predictable for each value by itself, regardless of the rest of the table.
Good morning Sean Holland
related to this blog post I would like to understand how can i cast format or manipulate LOB Data Type to be able to read the whole content.
Could you let me know a bit more about your experimentation.
Maybe something related to undoing of the modification to the LOB.
A performance "feature" that always re-inserts LOBs on each modification.
Probably why the LOB reclaim needs to be run.
Thanks for this blog,
I try to import a 34MB and a 5MB xml whole file in one row like you but I discovered that the file was not complete, only 406451 characters was imported from the first file and 4351351 characters from the second file.
I create the table
create table TEST_IMPORT_WHOLE_FILE_NCLOB(FILE_CLOB NCLOB)
then I import the file
IMPORT FROM CSV FILE '/data_rec/test/FILE001.xml' INTO TEST_IMPORT_WHOLE_FILE_NCLOB
WITH RECORD DELIMITED BY '¤'
I use '¤' to not split file into multiple rows since this character is not present in my xml file.
But the data in field FILE_CLOB does not contain whole file.
Can you share please about how you did import your xml file in one row please ?
we are using hana2sp04 on SUSE Linux Enterprise Server 12 SP4