In life, there are simple rules to live by. For example, Jim Croce told us that “You don’t mess around with Jim”. In the database world there are similar rules like “you never issue an UPDATE or DELETE statement without a WHERE clause. In this blog post, I’m hopefully going to convince you to add another rule – “Always include the ERROR LOG clause with your IMPORT FROM command”.
I’m currently working on a Big Data project where I’m importing the results from a generated data file based on Wikipedia page count data. After doing the import operation, I did a SELECT COUNT(*) on the resulting table and then got to wondering – was I hallucinating or am I missing almost 2 million rows of data?
It turns out that I was missing more than 2 million rows of data – yikes! So what’s going on? When I ran the IMPORT FROM command, it reported that it took 45 seconds, affected 0 rows (that alone is a bit disturbing) and that there were no errors. So, again – what’s going on?
Since the data I’m getting from Wikipedia could be suspect, my first inclination is that it had to be a problem with using a pipe “|” symbol as a delimiter. My original file actually used the 0x01 character as a field delimiter and I has used the following sed Linux command to change them to the pipe character:
sed “s/\x01/|/g” 000000 > 000000.csv
I then used the wc (word count) command to count the number of rows in both files to compare the results.
hana:/wiki-data/year=2013/month=05 # wc –lines 000000*
As you can see, the line counts were identical. So, I opened up the help topic for the command at http://help.sap.com/hana/html/sql_import_from.html and noticed that there is a clause called “ERROR LOG”, so why not give it a try. I went ahead and added the following clause:
ERROR LOG ‘/wiki-data/import.err’
After running the IMPORT FROM command again, I got no errors, but what was weird was I also had no import.err file in my /wiki-data directory. This I’ve seen before, so I issued the following Linux command to make sure the SAP HANA database engine can write data to this directory:
chmod 777 /wiki-data
Lo and behold, I ended up with a 55 meg import.err file! It turns out that there were two things preventing the load of all the data. First, one of my column definitions was not large enough to support the longest of the Wikipedia page titles which was 1023 – more than double of the VARCHAR(500) that I had defined. So, I dropped the table and recreated it with a column length of 2000 to be on the safe side. I then came across a new – numeric overflow. It turns out I needed to use a BIGINT data type for the number of bytes download for an hour for pages. After making that correction, I now got the COUNT(*) to match the line count for the three CSV files that I imported.
I was lucky and noticed that the COUNT result didn’t seem right and tracked it down, but I’m guessing that most people that use the IMPORT FROM command aren’t using the optional ERROR LOG clause. So – back to the new rule.
- Create a directory that you will use for your error file and make sure the HANA database engine has rights to the file using the chmod 777 <directory name> command.
- Just because the IMPORT FROM command reports no errors when running it from SAP HANA Studio, doesn’t mean there were no errors. Always include the ERROR LOG clause and then check to see that it’s a zero byte file. Otherwise, open it up and examine the records.
- Tell your friends and colleagues about this rule. 🙂
So in the spirit of Jim Croce, you can check it out my updated lyrics and sing along:
“You don’t tug on Superman’s cape”
“You don’t spit into the wind”
“You don’t pull the mask off that old Lone Ranger”
And you don’t forget the ERROR LOG clause for IMPORT FROM command.
Again, data is a precious thing to waste, so please pass this on.