Hana VARCHAR/NVARCHAR confusion
Yesterday I had a discussion regarding VARCHAR and NVARCHAR datatypes and their handling in Hana. At the surface everything is clear: Use VARCHAR for ASCII (English) characters only, everything else must use NVARCHAR. The question was what happens when a VARCHAR is used by accident?
Here an example: I create a table, insert two rows, select from it and all appears to be perfectly fine.
The first indication that there are actually differences is when I add another city that is 8 chars long. Here I get an error:
The error is on CITYNAME1, the varchar(9) column. Reason is obvious, as the Hana manual states the VARCHAR datatype is for 7-bit ASCII chars only but our text has some non-ASCII chars.
When I try to insert the 8 character long string ‘Shēnzhèn’ its bytes(!) get inserted and due to the two double byte characters, the byte array is 10 bytes long. Does not fit into a VARCHAR(9).
The UTF-8 representation of that city name is
S | h | ē | n | z | h | è | n 53 | 68 | c4 93 | 6e | 7a | 68 | c3 a8 | 6e
That raises two interesting questions for the Běijīng case:
- Why does the first insert work although we are inserting non-ASCII chars into to varchar? Shouldn’t it raise an error like “Hey, that’s no ASCII string!”?
- Why does the select return the correct text although it is read from a data type that cannot handle it?
For the first I would argue yes, it should. Currently the approach is that a text is sent to the database and its bytes are inserted. There is neither any logic nor tests being applied. The string is simply treated as an array of binary numbers.
And that actually answers the second question as well. The binary data is read from the database as a stream of bytes and most(!) tools use the default UTF-8 converter to show the text, thus the text appears to be correct. But there is no guarantee for that. Other tools will show the text as a list of ASCII characters and return “BÄ ijÄ«ng” instead of “Běijīng”.
In other words, when inserting the the text “Běijīng” into a VARCHAR and selecting the value with the same tool, the same error is made twice – convert text to binary and binary to text with the same conversion – and hence they cancel each other out, all appears to be fine. But that is pure luck. Another tool might use another default character set, especially Windows native tools.
You should really store ASCII chars in VARCHARs only and use NVARCHAR for all other.
It does not help that all Hana internal functions like length(), substring(), etc are using the UTF-8 values also. For the length function the Hana manual explicitly states so, for the substring it does not. As UTF-8 and ASCII are identical for all characters from 0 to 127 (7bit), no special handling is needed if VARCHAR contains just 7-bit ASCII values. But if it does not, we are in an undefined state.
As Hana does not check – you must!