Most hardcore HANA nuts find themselves at a Linux shell console a lot of the time? Data quality issues are endemic when dealing with data from external sources, and we find ourselves with 50-100GB data files on a HANA appliance which need cleaning up. Because HANA runs on Linux, it’s often convenient to do this on the appliance.

So I thought I’d reach out to the world’s leading HANA experts, the HANA Distinguished Engineers, to see what their hints and tips are. But first, a few of my own. I was lucky enough to

Get the files on the HANA appliance and fix them there

It’s not supported in a productive instance, but for development, test, pilots and PoCs we often cheat. So copy the files onto the HANA appliance data drive or log drive and crunch them there. Just make sure you don’t run out of disk space, because bad things happen if you do. You can take advantage of HANA’s fast local disks and immense CPU/RAM power.

HANA only supports tab, comma, semi-colon and pipe delimited files

It’s a bit inconvenient but HANA doesn’t support other delimiters, even when you specify the delimiter in the load control file. So if you have files that are delimited some other way, you need to convert.

Think very carefully about what tool you use.


If you want to do a simple substitution of Ctrl-A (Start of Header) delimited files to CSV then there are several ways to do it:

tr ‘\01’ ‘,’ < input.dat > output.csv

sed -e ‘s/,/\01/g’ input.dat > output.csv

awk ‘{gsub(“,”,”\x01″)}1’ input.dat > output.dat


It might surprise you that the performance is very different. All only run in a single thread. awk runs in 22 seconds, tr runs in 32 seconds in my test and sed runs in 57 seconds.

One interesting trick is to avoid using the /g addition to sed. /g is always expensive because it matches all such strings.

sed -e ‘s/,/\01/’ input.dat | sed -e ‘s/,/\01/g’ >/dev/null

Actually runs 15% faster.

Useless use of cat

Cat is the friend of the lazy UNIX admin and it generally just adds overhead. For example, the following commands do the same thing:

There is even a page dedicated to the useless use of cat. Awesome.

Learn awk

My biggest piece of advice is to learn how to use awk, especially if you are processing fixed format files into CSV. Here is an example of an awk file to turn a fixed-format file with a datestamp and a text element into a CSV. This runs incredibly quickly!

#!/bin/bash

awk -v DATESTAMP=`date -d ${1:8:8} +%F` ‘BEGIN{FIELDWIDTHS=”2 2 2 3 10″}{

    HOUR=$1

    MIN=$2

    SEC=$3

    MSEC=$4

    TEXT=$5

    printf (“%s %s:%s:%s.%s,%s,%s,%s,%s,%s.%s\n”, DATESTAMP, HOUR, MIN, SEC, MSEC, TEXT)

}’ $1 >$1.csv

Download a few tools to help you on your way

I put a short document together that details my favorite HANA tools. I hope you enjoy!

Parallelize

If you use pigz, you will be parallelizing already. One of my favorite tricks is to write awk scripts and then run them on 20 files at once. Because these scripts are usually CPU bound, each will take up a single thread. With the example above we would do

ls *.fixedfile | parallel ./awk.sh

Go and make a cup of tea and you will have all your files extracted

Go and load your data into HANA in parallel

I put together a Best Practices for HANA Data Loads document a long time ago. I think it’s still pretty relevant today, and will help you get data into HANA fast.

Final Words

None of this is good practice in a real productive HANA environment and you will probably have an ETL tool like SAP Data Services or a streaming tool like SAP Event Stream Processor. But when it comes to throwing together demos fast, these tips may get you out of trouble. I hope you enjoy!

To report this post you need to login first.

9 Comments

You must be Logged on to comment or reply to a post.

  1. Wenchieh Yen

    – one should consider using “in-place edit” capability of sed, perl or whatever, if disk space is a factor.

    – to my surprise that sed was much slower than awk. but it could just be an already-cached-or-not situation. hence, /dev/null is your friend by test or first-run.

    – one extra thing: not all ‘sed’ or other exactly-named tools are expected to work the same way (e.g. GNU sed vs BSD sed). be careful.

    to John:

    – not too sure what your wanna achieve (generally) with sed -e ‘s/,/\01/’ input.dat | sed -e ‘s/,/\01/g’ >/dev/null. the first sed just replaces the first occurrence in every line and let the second sed process the rest occurrences in a line. a single sed with /g will just work. but if you really need it: sed -e ‘s/,/\01/;s/,/\01/g’ < input.dat > /dev/null.

    cheers,

    (0) 
    1. John Appleby Post author

      Sure you’re right, but disk space is rarely a factor on a HANA system, which is very handy.

      I don’t know why but tr and awk are both usually awesomely fast for simple substitutions on very large files. sed, less so.

      Yes, but using a pipe uses 2 CPU threads. This means that having 2 seds piping into each other and matching is actually faster than using /g. This is counter-intuitive, but true, on a multi-CPU machine like HANA. Obviously the CPU cost overall is higher.

      Hope this helps.

      (0) 
      1. Wenchieh Yen

        Sure. If disk space were the very concern, one would have already added gzip/gunzip to the pipe chain. Yet, if it were not a concern, in-place edit would still save you from hogging the disk-write bandwidth, cpu cycle, and benefit from quick random access of SSDs (if used).

        This redundant piping is counter intuitive. But the real reason is not what you described.

        Technically, you will see two processes starting/running at the same time. But it is not that they are “piping into each other”. Pipe is FIFO (first-in-first-out). The communication between the “pipees” is sequential from left to right through buffer. You don’t achieve parallelism just by using pipe.

        Factually faster in this case and I don’t doubt the outcome. I just don’t think that redundant piping shall be faster in general. Otherwise we would have easily deprived us more legitimate coffee time with

        – replacing n delimiters in a line with n piped seds without /g, or

        – letting two piped sed with /g handle the odd and even lines separately.

        The faster processing could be due to better cpu-level shared cache filling (two processes executed on two cores of the same cpu node) or reasons that need more fine manual reading. But I think I should stop here and not drift from the original well-intended post of yours:

        – Don’t get beaten by big data file size before getting to big data itself.

        – Fine tools are there. Go back to basics and don’t invent new wheels.

        – Don’t get fetish with your tool. Get the job done quick and go home!

        cheers,

        (0) 
        1. John Appleby Post author

          🙂

          I actually did more research because I was fascinated by this. I’m aware of how UNIX pipes work and you can pipe multiple sed processes together. You get a non-linear performance increase for each sed process you add up to the number of cores you have in your CPU. On multi-socket systems it is less predictable due to processor affinity.

          But the key point is – if you have multiple pieces of work to do on a file, you can sort of parallelize those tasks by using UNIX pipes. You can think of this like a manufacturing plant with the division of labor. Clearly the work does not happen in parallel, but having multiple workers all doing one chunk of work increases throughput.

          “Get the job done quickly and go home” – yes!

          (0) 
          1. Wenchieh Yen

            btw, having a cup of “tee” (man tee) with proper shell redirection plus named pipe you can achieve some truer parallelism by having e.g. two separate seds working on two parts of a file (even/odd lines) at the same time.

            (0) 
  2. Jake Echanove

    John,

    Thanks for the list. I had a PoC where we were given a CSV from TerraData and it had white space and question marks that were causing the upload to fail. I came across sed and it worked very well to parse through the large file. Not too difficult to learn. A few tries and it did what I needed.

    This command removed the white space in place: sed -i -e ‘s/ *//g’ <file>.csv

    I’ll give awk a go.

    Thanks again for the very helpful blogs.

    -Jake

    (0) 
    1. John Appleby Post author

      Yeah Teradata CSV exports are a pig, same with Netezza. I call their format What-The-UTF.

      Yes that’s a good substitution for TD/Netezza. I don’t think awk would be any faster here. You could also try:

      iconv -c -f utf-8 -t ascii <file>.csv

      strings <file>.csv

      I like optimizing this stuff because I find my time important to me 🙂

      John

      (0) 
  3. Frank Bechmann

    Hi John,

    Interesting stuff here, thanks for the info. When already cherry-peeking from the Unix tool chain, you can also add make and m4 to the picture.

    Make is nice to orchestrate the chain of process steps (e.g. uncompress, iconv/sed, creation of control and sql files, eventually performing the upload using hdbsql).

    M4 is useful for creating the control and sql files from templates, which is a huge time saver when having to upload many files.

    I did this just now for a tightly-scheduled project with 40+ files and 1+ TB data, and the experience was already only sometimes painful (as known from all make/m4 projects before), but also already with an overall efficiency gain. Next time I’ll start with template and it will get better from there.

    Thanks again and best regards, Frank

    (0) 

Leave a Reply