Skip to Content
Business Trends

Best Practices for SAP HANA Data Loads

I’ve been involved in a few SAP HANA projects where performance has been initially disappointing, and I have been asked to come and show the customer HANA speed. As I’ve said before, you can take the best technology in the world, create a bad design, and it will work badly. Yes, even SAP HANA can be slow.

 

With that in mind, here are my best practices for HANA data loading.

 

1) Build the right architecture

 

Every technology has limits, and HANA is constrained by the same laws of physics as any other computer system. The major differences with HANA are:

 

a) Data is kept entirely in-memory and therefore accessing all data has the same latency

b) Data is compressed in-memory and only uncompressed in the CPU itself

c) Data may be loaded into all cores (up to 80) simultaneously and a single instruction executed against all at the same time (massively parallel)

 

If you think about all this, then by far the major challenge with HANA is getting optimized data in fast enough. Networks bottom out at 300MB/sec, disks are often even slower and if you need to load HANA data fast, then you need to think carefully about your design.

 

2) Optimizing the data model

 

HANA doesn’t work like Oracle, so you need to adjust your thinking. Here are the golden rules for HANA tables:

 

  • Keep tables as narrow as possible. HANA has a cost for columnar inserts, so you get a reduction in data load performance with an increase in the number of columns. Don’t build out fields you don’t need. Interestingly, very wide tables have lower throughput in MB/sec than narrower tables. Once you get over 80-100 columns, you will see up to a 25% degradation in throughput.
  • Consider your partitioning strategy. On a single node appliance 3-5 partitions is normally a good number for optimal insert performance. On a scale-out appliance, you can multiply this by the number of active nodes over which you distribute the tables.
  • Use primary keys only when required, for transactional apps. If the tables are denormalized then it may not be necessary to have a primary key.
  • Use the most efficient datatype for your scenario. For instance, TIMESTAMP is much more expensive than DATE, if you don’t need the time. Same with large integer and decimal fields – they are more expensive because they can’t use such efficient SIMD instructions.

 

Good Bad
CREATE COLUMN TABLE “SCHEMA”.”TABLE” (

“TIME” TIME,

“FIELD1” VARCHAR(1),

“FIELD2” VARCHAR(16),

“FIELD3” VARCHAR(4),

“MEASURE1” INT,

“MEASURE2” DECIMAL(7,4))

PARTITION BY ROUNDROBIN PARTITIONS 12;

CREATE COLUMN TABLE “SCHEMA”.”TABLE” (

“TIME” TIMESTAMP,

“FIELD1” VARCHAR(255),

“FIELD2” VARCHAR(255),

“FIELD3” VARCHAR(255),

“MEASURE1” BIGINT,

“MEASURE2” DECIMAL(34,7)

“UNUSED_FIELD” VARCHAR(2000));

 

Note that in my test example here, the table on the left loads 6x faster. Note that I use 12 partitions because I have a 4-node environment with one master node, and the table partitions distributed over the remaining nodes – 4 partitions in each of 3 nodes. This gives maximum table distribution and therefore throughput.

 

Note that merge deltas operate better with multiple partitions. But if you really want to learn about table partitioning, read “Table Partitioning in the SAP HANA database” section in the “SAP HANA Administration guide”, here: SAP HANA Appliance Software – SAP Help Portal Page

 

3) Get the upper bound by optimizing a control file load

 

I always test control file loads with HANA first, using CSV files. If your files are in the right format then control files are always the fastest way to load into HANA. They were written by the bulk loader team and you can get amazing load performance. This is how I do it:

 

a) Turn off database logging

 

On a single-node appliance, or a scale-out appliance like IBM or Hitachi that uses Fusion-IO for logs, you will get the same performance with and without logging enabled. The only difference is you will fill up the log disk and crash the database. Turn them off until you do your final testing.

 

You do this by double clicking on your database node on the left hand side of HANA Studio, click Configuration -> global.ini -> persistence. Set enable_auto_log_backup to no and log_mode to overwrite.

 

b) Get the file in CSV format with the minimum number of columns

 

I usually use bash script to do this, with a combination of grep, awk, sed, cut and even perl, if it gets ugly. You can convert fixed format, or any format, to nice clean CSV files. Perl even has a CSV library that you can use to do this. Here’s a neat example that sucks in your CSV and gives you the number of rows, columns and the maximum length of a column.

 

#!/bin/bash

typeset -a a END

let a=1

COLUMNS=`awk ‘BEGIN {FS=”,”} ; END{print NF}’ test.csv`

ROWS=`wc -l < test.csv`

echo $ROWS ‘ rows’

echo $COLUMNS ‘ columns’

while ((a

echo -n $a

cut -f $a -d, < test.csv | awk ‘{ if (length($0) > max) {max = length($0); maxline = $0} } END { print maxline }’| wc -c

let a++

done

 

Quite often, I create demo data in my lab based on a customer’s schema. I use one of a few tricks to do this:

 

i) Excel

 

Excel is great for this and you can create a million rows at a time using tricky functions like this (put your values, weighted, in Sheet 2 Column A):

 

=INDEX(Sheet2!$A:$A,RANDBETWEEN(1,COUNTA(Sheet2!$A:$A)),1)

 

Then I copy the CSV file to the HANA appliance and run a bunch of commands like:

 

for a in {1..9} do cat test.csv |sed s/2013-02-04/2013-02-0$a/; done >> testbig.csv

for a in {10..28} do cat test.csv |sed s/2013/02-04/2013-02-$a/; done >> testbig.csv

 

This allows me to explode a 1m row file to as many rows as I like.

 

ii) Data Generator

 

For more complex files I use a tool called DTM Data Generator. This allows me to generate complex files at about 10,000 rows/sec to a CSV file (don’t use it to generate data directly into HANA, performance sucks). Overnight I can generate about 360m rows, which is enough for testing. It costs $149 for the Standard Edition, which allows 10m rows per job. So I create the job I want and duplicate it 50 times. If I wasn’t so cheap, I’d buy the Enterprise Edition at $349, which allows 2bn rows per job and multithreads.

 

c) Time the load of the file to /dev/null

 

# du -m table.csv

30358 table.csv

 

# time cat table.csv > /dev/null

real 0m9.667s

 

So I’ve put this 346m row, 30GB file on a Fusion IO disk and I can read it at 3140MB/sec (or 35.8m rows/sec). I know now that this is the absolute maximum possible to ingest into HANA. Bear in mind that in the real world, it’s really hard to move data around at more than 100MB/sec – networks are slow.

 

d) Build a control file

 

Now I create a control file, like this:

 

cat > /sapmnt/log/test.ctl

import data

into table SCHEMA.TABLE

from ‘/sapmnt/log/test.csv’

record delimited by ‘n’

field delimited by ‘,’

optionally enclosed by ‘”‘

error log /sapmnt/log/test.bad

 

And what’s great is we can iterate over a) the number of partitions b) the number of threads c) the batch size

 

e) Make sure the load runs well

 

Make sure you aren’t generating lines in the BAD file, because this slows downloading substantially. The data needs to be clean. Fix it with the UNIX tools above (usually sed, cut, awk, grep) if necessary.

 

f) Iterate, iterate, iterate

 

I make sure my CSV file has enough data to make HANA think (20-30GB is a good amount) and I iterate. For my example, my 346m row file is 30358MB (30GB). With the default options (1 partition, 1 thread, default batch size), the load happens in 1 hour, 35 minutes 30 seconds. This is not HANA speed!

 

So I guess a rough set of parameters:

 

  • 5 partitions per active table node (15 partitions in my example)
  • The same number of threads I have CPUs* active nodes (120 threads in my example)
  • Batch size depending on the width of the column. For a simple table like this, I run with 200000. For much wider tables with 50-100 columns, I start with 20000.

 

With the SQL below, I get improved performance down from 5730s to 88s. That’s more like it, and now we start tweaking to get the optimal settings for those 3 parameters via iteration. I expect this to improve performance a further 30% in most cases. You may not think it’s worth it, but that extra 30% matters to me.

 

DROP TABLE “SCHEMA”.”TABLE”;

CREATE COLUMN TABLE “SCHEMA”.”TABLE” (

“TIME” TIME,

“FIELD1” CHAR(1),

“FIELD2” CHAR(16),

“FIELD3” CHAR(4),

“MEASURE1” INT,

“MEASURE2” DECIMAL(7,4))

PARTITION BY ROUNDROBIN PARTITIONS 15;

ALTER TABLE “SCHEMA”.”TABLE” DISABLE AUTOMERGE;

IMPORT FROM ‘/sapmnt/log/test.ctl’ WITH THREADS 120 BATCH 200000;

MERGE DELTA OF “SCHEMA”.”TABLE”;

ALTER TABLE “SCHEMA”.”TABLE ENABLE AUTOMERGE;

 

And I run it lots of times. I iterate on threads, then partition, then batch, and sometimes again. I build a table with load times and tune it until I feel like I’ve got HANA speed. The key is to copy and paste this statement 50 times into a notepad file and then change the parameters for each load. Execute it as one long SQL statement and leave it running overnight, loading, dropping and loading 50 times or more over.

 

What’s HANA speed? In my example down from 5730 seconds to 60 seconds – 95x performance increase. Nearly 6m fully committed rows/second and 500MB/sec throughput. On a single-node Medium appliance with 40 cores, you should expect around 275MB/sec of throughput into HANA. In any case we are around the throughput of 10GBe Ethernet.

 

Anyhow in most cases, control file loads aren’t a good way to load into HANA long term, because they lack flexibility and the ability to process bad data. But now you have an upper bound for how fast you can ingest. The Data Services team tells me they look to get performance that approaches the HANA bulk loader, for example.

 

4) Design the right Architecture

 

Now you know enough about your data to design the right loading architecture for HANA. Note one really important thing: I like to think of HANA as a hungry caterpillar (reminds me of The Cure song Lullaby, but anyhow). How fast you get data into HANA is almost always limited, in a well designed environment, by how fast you can supply.

 

This means that you need fast storage where you are residing the data to be loaded, fast applications server threads to process it, and fast plumbing – always 10GB Ethernet – to move data around. If you don’t have all of this, your HANA in-memory appliance will go to waste. Don’t think you can run Data Services on a VMWare farm, for example, and expect to move data around at more than 80MB/sec.

 

From there, the choice of tool, or tools, is critical. Use this table as a rule of thumb.

 

Use Case Control File Data Services LT SRS ESP
Benchmarking and PoCs X O O O O
Batch Loads O X O
Transformations X O O X
Fixing Data Quality X O
Streaming Data X
Real-Time Replication O X X O
Replication From SAP O X O O
Tell me what else you want to see here!

X – Core purpose

O – offers the functionality

“-” – does not offer the functionality.

 

a) Control Files

 

As I’ve said before, Control Files are great for benchmarking and PoCs. And for beating up other ETL teams to get near their performance. After that, you aren’t likely to use them much in a productive system.

 

b) Data Services

 

Data Services is a great general purpose batch tool. It offers real-time web services but there are better options if you want real-time. It also allows great data transformation, matching and enriching, but every additional transform you add to your batch job will slow it down, so be careful.

 

In addition if you are loading files from weird formats, like fixed format files, then Data Services is great. There are lots of tips and tricks for SAP HANA to be found at http://wiki.sdn.sap.com/wiki/display/EIM/Data+Services but there are 3 golden rules I apply:

 

i) Do major transforms on the way into SAP HANA, not in SAP HANA. This advice will change because they are integrating the Data Services ETL engine into SAP HANA, but for now, get your data formats right (e.g. combine date and time into timestamp) on the way in, rather than using materialized views in SAP HANA, which is expensive.

 

ii) Dial the 3 main parameters (Parallel Process threads in the file loader, bulk loading on the target table, commit size in the HANA table loader and loaders in the HANA table loader) in for your scenario. I start with 40 file loader threads, 10,000 commit size and 4 HANA loaders and tweak it from there. Data Services takes care of optimized merge deltas, which is very cool.

 

iii) Make sure you have plenty of power on the Data Services box – you can easily consume 16-20 cores on x86 CPU, make sure the threads are fast, and make sure you have 10GBe Ethernet from HANA -> Data Services.

 

c) SAP Landscape Transformation (LT)

 

LT is a very neat solution for replicating tables from SAP systems. It is trigger-based so there is a small overhead, and it is therefore suited to low-mid volume transactional tables like sales orders. It can be used to replicate tables from non-SAP databases like Oracle and this makes sense in a SAP shop. If you’re not a SAP shop then you wouldn’t use LT because it requires a SAP NetWeaver platform (either standalone for LT, or built into another SAP product like the Business Suite).

 

In these scenarios, LT is easy to set up to replicate tables and this is especially cool with SAP’s Rapid Deployment Solutions (RDS), which provide pre-packaged HANA content that consume database tables from LT.

 

The main things to note with LT are the need for a Unicode LT system (can be used against non-Unicode source ERP) and the need to patch your ERP system to a recent kernel, which may trigger a regression test for your production environment. Also note that ERP, LT and HANA must be on a dedicated LT 10GBe Ethernet network for acceptable latency.

 

d) Sybase Replication Server (SRS)

 

SRS was originally the replication technology of choice for SAP ERP scenarios, before the advent of LT. It has several benefits over LT, because it scrapes database logs, which has lower latency and overhead than the trigger-based LT. Stock Exchanges in Capital Markets use SRS to replicate trading data between continents, it’s that fast.

 

You do however need to have a full Enterprise license for your source DB, and many SAP customers buy a restricted runtime license via the Software Application Value (SAV) program, which allows them to rent their Oracle or IBM database for 8-15% of their Application Value. If you fall into this category, then you cannot use SRS legally.

 

In addition, because it scrapes database logs, SRS has some technical dependencies like Unicode, certain versions and certain databases.

 

But if what you want is a high-performance sidecar of an existing transactional database into SAP HANA, then SRS is a very interesting product, if niche.

 

e) Sybase Event Stream Processing (ESP)

 

ESP is the dark horse of the ETL family because it can do everything, and it can do some things very well. Its core purpose is stream processing, and you can look at the number of different streams it supports here: SyBooks Online – including Tibco Rendezvous, Capital Markets streams like Wombat and FIX, and files.

 

Its flexibility is excellent, and you can read fixed-format, CSV, FIX, XML or any other kind of data and output it to a stream, or to a database or file adapter. What’s more, ESP is being integrated into the HANA database so you will be able to do all of this in-memory. ESP is low-latency and can shift a very large amount of messages a second (I hear millions), depending on the hardware and networks you have at your disposal. What’s more you can store a buffer of data in memory for doing time stream processing.

 

As with any stream processor, when you hit saturation, you will start to page to RAM for incoming or outbound messages, then when RAM is filled up, to disk, and then bad things happen. As with Data Services, you can adapt the number of inbound and outbound feeds to get the performance you need, and adjust the output batch accordingly. Note that ESP is very well integrated with the SAP HANA ODBC interface and you can expect to get excellent performance.

 

5) Benchmark and iterate

 

Now you have chosen a product for your loading, you need to benchmark and iterate to try and get as close to the control file performance as possible.

 

In my tests, I found that the bulk loader performed about 2x as well as anything else, but this is sharing CPU capacity with SAP HANA. I expect that by putting Data Services or ESP onto a separate system with low-latency, high-bandwidth 10GBe network, you can get close to 70-80% of bulk loading performance.

 

Conclusions

 

A number of people I spoke to suggested that a set of best practices for data loads wasn’t possible, because every customer is different. I’d counter that by saying that a set of best practices is required, to understand which tool should be used and when, depending on a potentially complex set of criteria. My first conclusion is I believe it is simple to select the right tool with some thought.

 

The second learning I have is that each data set performs slightly differently with HANA because of the complexities of massive parallel processing, columnar compression and column widths and complexities, plus the need to transform data. Once you have selected the correct tool, it is necessary to iterate to get optimal settings – and iterating is well worth it.

 

The third learning is to make sure you do not compare HANA bulk loading to Oracle bulk loading. HANA bulk loading is fully report-ready data. Oracle bulk loading literally dumps data into a database row, with no indexes, aggregates or save point. It breaks the ACID properties of the database until a backup is made. There is no point of making this comparison.

 

And a big shout out to:

 

The IBM X5 appliance that I use for testing. It is a 4x512GB = 2TB, 160 core appliance in a 22U rack.

 

All my contacts at SAP who have assisted in this, including Werner Daehn, David Rosenblum, Werner Steyn, Stefan Sigg, Keith Wood and Tom Ziek.

 

Lloyd Palfrey at Bluefin who helped out on the HANA appliance side and Kevin Turnbull and team for creating virtual machines at short notice and generally putting up with me.

40 Comments
You must be Logged on to comment or reply to a post.
  • Super blog! I just want to add that for scenarios when I have the SFLIGHT database (), I generate random data using SQL script:

    INSERT INTO “TEST”.”TEST_COL”

    SELECT “TEST”.”KEY_SEQ”.nextval,

    CASE

    WHEN to_int(rand() * 2) = 0

    THEN ‘M’

    ELSE ‘F’

    END

    FROM “SFLIGHT”.”SBOOK”;

    TEST_COL has two columns – a key field and a one character field. The KEY_SEQ sequence is used to generate integer numbers from zero to +infinity.

    • Thanks I’m glad you enjoyed, I hope to make someone’s project more successful.

      You can do this, but it’s pretty expensive in SQLScript if you want to generate very large volumes of data, plus you don’t really get to test data loads. There are other options I’ve seen like a Python script that SAP use to generate raw data. I’d share that but it’s not my IP.

      I’m a big fan of using perl to generate data too as it’s easy to use and you can run it on the Linux HANA server so there’s no hassles moving data around.

      If you use Excel, remember to turn automatic recalculation off, and recalculate manually with F9. Also tip – create the first few rows then select the last of those rows, then click the cell range at the top left and type A3:F1000000 then Fill -> Down. You can create a million rows in a second on a fast machine.

  • I enjoy reading your blogs and am an avid follower – props to you John for this amazing blog, and sharing your practical know-how especially in this exciting product.

  • Thank for you the article. I have a slightly different issue. I need to load 3000 csv files to one hana table. Do I need to create 3000 control files? Is there an easier way? Thanks in advance!

    • If you use control files you need to either write 3000 control files (you can use sed to generate them), or concatenate the CSVs.

      Better still is to use Data Services, where you can apply wildcards to filenames. So you can load “table*.csv” or “*.csv” within a folder on the server. I’ve used this to successfully load data from multiple days in multiple files.

  • Hi,

    great blog. you have mentioned to use CHAR instead of NVARCHAR, but in the reference nothing is stated out about a data type of fixed CHAR? Is it planned in future release???

  • By far the best “Best Practice” document for HANA I have seen yet. Thanks a lot for sharing your experiences and knowledge.

    One comment regarding data loading from control file on a single node, I have found the optimal thread size to be about 80% of the available CPUs (after hundreds of iterations)

  • Just want to add my thanks as well. Haven’t been hands-on much with HANA yet and in my experience it’s been hard to find many articles that give you real insights and useful technical details like this instead of a whole load of hype and marketing waffle. Cheers!

    • In a similar vein I’d love to read your take on HANA One running on AWS if you’ve looked at it, given the architectural design considerations you detail above. Specifically I can’t help but wonder what the point of the new HANA One Premium offering is.

      As per usual it takes a while to penetrate the SAP marketing guff, but once you do get to the details I’m a bit confused. Sure there are fast 10Gbit Ethernet interconnects between systems within AWS but it supports using backend data sources such as BW – how will data throughput from on-premise servers be anything but a huge bottleneck in this case (and therefore why bother with moving to and optimising data for HANA in the first place)?

      I know there are relatively small datasets involved (coincidentally they quote 30GB as the max recommended dataset size, the same figure as in your example data load) but I don’t see how this would help, except maybe for very specific use cases where the data changes infrequently and it’s used more for fast pulling of data that stays unchanged in the HANA instance (essentially just reporting). If this offering was intended just for sandpit-style uses as a way of conducting easy proof-of-concept/prototyping that’d be one thing, but this offering seems to be geared at productive use (full SAP standard support etc.).

      • John,

        I like to address your questions about HANA One and HANA One Premium. Let me start with key differences between HANA One and HANA One Premium:

        • HANA One is the ideal offering for an innovation use case that does not connect to other SAP software (license limitation). Nevertheless, it gives you the full feature set and power of SAP HANA.
        • HANA One Premium provides the license to connect to the SAP backend like ERP or BW. It also comes bundled with SAP HANA Cloud Integration as well as official SAP support.

        Of course we have tested integration scenarios of HANA One Premium with an on-premise backend and we did not see any performance problems. However, each use case is different and the best way to explore is to try it out yourself.

        For that purpose, we are in the process of building a full blown exploration and trial environment involving several live test systems hosted in the cloud and provided to customers a few hours at a time at SAP expense. This HANA One Premium trial offering will be tied to the HANA Marketplace once it is completed, so please check there.

        I hope this answers your key question and here are some additional resources for you to explore:

        • Detailed comparison between different HANA deployment options
        • Webinar explaining HANA One Premium
        • HANA One webinar channel; we are presenting customer and partner success stories at least every two weeks. Check out our customers Globe and Mail for Online Clickstream Analysis or Mantis for Social Sentiment analysis on HANA One (both of them are Big Data use cases and HANA One handles them very well!).

        Hope this helps and let me know if you have any more questions.

        Cheers,

        Swen

        PS For full disclosure, I am marketing owner for all cloud based SAP HANA offerings

        • Thanks for the reply Swen, appreciate it – always good to see you guys engaging like this. I’m sure you have tested those scenarios, it’s just that everything “hands-on” that I’ve read left me with a lot of questions as to how effective this approach could be. That trial/exploration environment sounds very interesting – it’d be great if I get a chance to answer them for myself! I’ll keep an eye on the HANA Marketplace and will check out those other links as well. Again, thanks for the reply and the pointers.

  • John, thanks a lot for the great post!

    I have 2 questions:

    1) I use Data Services and need to convert the date in unix ticks format from source table (which look in SQL-server for example as 1498798800000) into normal date time format while inserting into HANA. Could you advise the best way to do it?

    2) In case of source data format DECIMAL (20,0) what format is preferable to use in HANA for target table: INTEGER or DECIMAL (20,0).

    Thanks.  

    • Hi Nataliya,

      I’m glad you enjoyed!

      1) I don’t think that Data Services handles UNIX Epoch directly, but it’s easy to model: it’s the number of seconds since 01 Jan 1970. You can do the math using div and mod:

      date = ‘19700101’.

      date = date + unixexpoch div 86400.

      time = unixepoch mod 86400.

      2) You’re not giving me enough information about the source data to give a definitive answer but my general advice is to store the data once and store it well. Single precision decimal (stored as 32-bit) is up to (8,x) and double precision (stored as 64-bit) is up to (17,x). If you use (20,x) then it will be stored as a 128-bit representation which will be slower.

      BIGINT is 17 decimal places of precision and is stored in the same way as DECIMAL(17,0) in HANA. INTEGER is 8 decimal places of precision (actually up to 2bn) and stored in the same way as DECIMAL(8,0).

      Hope this helps!

      John

  • John, thank you very much for sharing these valuable insights. We need to replicate real time data from Oracle into a standalone SAP HANA implementation. We have Goldengate for Java (JMS) working on Oracle and would now like to use this to replicate the live data to SAP HANA. Our understanding is that we can use the SLT or the JMS adaptor. Do you have any tips on how to replicate oracle live data into HANA? Can the JMS adaptor be used with SAP HANA or is it limited to SAP ?

    Many thanks

    Marianne

    • Hi Marianne,

      Glad it helped you. So you can replicate HANA data from Oracle to HANA using either SLT or Sybase Replication Server. If you already have SAP in your landscape then SLT is a good option and it is included in the HANA Enterprise license (check with your account team for your specifics). SLT allows replication of SAP and non-SAP Oracle systems.

      In theory you could use Golden Gate but Oracle haven’t certified GG for use with JMS/HANA (or any of their other DB tools for that matter), so you would have to insert data using OData rather than JMS. Alternatively you could talk to your account team about allowing use of the GG JMS adapter for your scenario – I have no doubt it would work, it just isn’t certified.

      John

  • Excellent blog, Thanks a lot John.

    We are in the process of implementing S&OP on premise with a size of 4TB HANA appliance. Though answered my query, I wish to clarify once again our scenario. SAP is recommending DS but not SLT. I understand we are the first S&OP on premise customer. My understanding through is that we need to setup SLT rather than DS, am I in the right path. Appreciate your expert advice.


    regards

    Fernando Sagay 



    • So I’m not a SO&P expert and if SAP are advising you then you should probably listen to them and ask for an expert, especially if you believe you’re a very early customer.

      However if you look at the Installation Guide then you’ll see that either Data Services or HCI is required, so I believe that it would be Data Services in your scenario. I can’t speak to why Data Services is required rather than SLT, but I assume it is because there are transformations involved.

      There’s a ton of content available here.

  • Hello John,

    very good blog! Just asking if the information provided in April last year is still up to date or do we have to consider advances/changes in that space?

    regards

    Oliver

  • Hi,

    Anybody can you explain about how to choose the preferable Adapters to Replicate the SAP and Non-SAP data and for the reporting Purpose. And how it would be suitable for SLT, BODS and DXC.

  • Great information! Thanks for taking the time to share. A few questions when you have a moment;

    – When transferring massive amounts of tables and data (over 2TB) from one SAP HANA to another SAP HANA system, what are your recommendations? Please consider the data may or may not need transformation. I noticed that you mentioned that do the transformation before the data gets into HANA than within HANA so it made me curious.
    – With regards to partitioning of data over multiple nodes, I imagine this is to retrieve the data in parallel but considering network latency as you well put, is it really faster than keeping all the partitions on the same physical system? I guess best way is setup a test and try it out based on your recommendations above 🙂 I am just curious if you had seen 100% of time that this distributed setup is much better than keeping all single table data in the same single node.
    – Another question related to second point above, I realize this is data load related blog but would you recommend fiddling with where the tables reside after they are loaded if I expect to run massive joins constantly or let HANA decide on the optimization and placement of them?
    – Finally, has anything changed in HANA world since you posted these findings? Anything got better optimized in terms of data loads, nodes and partitioning?

    Thanks a bunch!

  • – If you need to move tables from HANA to HANA, a good option would be Smart Data Integration – check here: https://hcp.sap.com/content/dam/website/saphana/en_us/Technology%20Documents/SPS09/SAP%20HANA%20SPS%2009%20-%20SDI%20and%20SDQ.pdf
    – Yes, but it will aggregate before transferring data so in most cases there isn’t a lot of network usage. Given the backbone network is 10/40GBe, you get linear scalability in many cases. One example where you do not is if you join two huge fact tables. If you have small dimension tables you can replicate them to all nodes.
    – HANA will do table placement but it won’t repartition the tables for you. It’s common to create a load script and then try a few partition structures before you decide, dropping the tables between tests.
    – Yes – with Smart Data Access and Smart Data Integration, there are some new options.

    Hope this helps!

  • surprisingly only 5 votes on this excellent article…

    can you elaborate a bit on the difference between control file load with SQL (import from csv file ‘foo’ into “bar”) vs. control file? Yes I’ve read that control file is your recommendation – are the backend processes different(ly called) for the two different invocations?

  • we have a situation where data archiving is planned in ECC, there are full loads that can happen in future . what impacts can we have to full & delta loads once ECC data is archived?
    how can we address these impacts?

  • Good Blog! If I want to move lower level data from HANA into other on-promise server or into the cloud for machine learning and AI applications, what would be the best practices? Data Services is not supported by current infrastructure! Extracting data from HANA as a flat CSV files and ingesting them into out-of-HANA system has been recommended by some. We have about 50 Million rows and 80 columns to move in a weekly basis. I want to understand the industry best-practice for moving such amount of data from HANA to non-SAP system. We are using 102.05 version of HANA with SPS 10 Patch 5. Thank you in advance for your thoughtful suggestions.