Best Practices for Migrating to ASE – Should you go unicode?
Lets assume you have downloaded the SAP ASE (Adaptive Server Enterprise) Database Server engine and you are now considering for any new application work, this will be focused on the SAP ASE platform and its supporting products.
Congratulations and welcome!
Now we have the Database Engine decision done, we have some decisions to make regarding CPU, RAM, internal structures, etc. In this blog and these future blog series I will be focusing on answering your questions regarding setting up the SAP ASE Server to avoid minimal hassle and tuning in the future. We’ll make some assumptions and provide guidance for those of you who are new to this technology. This article will provide you with the understanding on why choosing the right characer set and sort order is important.
The original problem statement
If you were in charge of a new database application, building a Custom application (referred to as a non-SAP application) and given the fact that newer SAP Applications require unicode, would you want to design something that is unicode by default?
For those of you that are new to unicode
Unicode is a universal encoded character set that enables information from any language to be stored in your Database Server using a single character set. As the name implies, Unicode provides a unique code value for every character. This is regardless of the platform, program, or language. Unicode is an international character set that supports over 650 of the world’s languages, such as Japanese, Chinese, Russian, French, and German. Unicode allows you to mix different languages from different language groups in the same server, no matter what the platform.
The task of designing an application to work outside its country of origin can seem daunting. Often, programmers think that internationalizing means hard-coding dependencies based on cultural and linguistic conventions for just one country. A better approach is to write an internationalized application: that is, one that examines the local computing environment to determine what language to use and loads files containing language-specific information at runtime. When you use an internationalized application, a single application can be deployed in all countries. This has several advantages:
- You write and maintain one application.
- The application can be deployed, without change, in new countries as needed. You need only supply the correct localization files.
- All sites can expect standard features and behavior.
Technical drill down
SAP Sybase ASE supported Unicode in the form of three datatypes: unichar, univarchar, and unitext. These datatypes store data in the UTF-16 encoding of Unicode. The character set model used by ASE is based on a single, configurable, server-wide character set. All data stored in Adaptive Server, using any of the character datatypes (char, varchar, nchar, nvarchar, and text), is interpreted as being in this character set. Sort orders are defined using this character set, as are language modules—collections of server messages translated into local languages.
The character set model used by Adaptive Server is based on a single, configurable, server-wide character set. All data stored in Adaptive Server, using any of the character datatypes is interpreted as being in this character set. Sort orders are defined using this character set, as are language modules—collections of server messages translated into local languages.
When you initially install or configure your server, you must specify a default character set for the ASE Server. The default character set is the character set in which the server stores and manipulates data. Each ASE server can have only one default character set.
By default, the ASE installation tool assumes that the native character set of the platform operating system is the server’s default character set. If you want to go unicode, this is the spot in the installation where you need to change it. For example, if you are installing the server on IBM RS/6000 running AIX, and you select one of the Western European languages to install, the installation tool assumes the default character set to be ISO 8859-1. If you are installing a Unicode server, select UTF–8 as your default character set. The key is to look for the “UTF” in the drop down list of choices.
Lets assume you are going with UTF – 8 , we are not done just yet, we need to choose a “sort” order. Why is sort order important? Consider the following: if you have users using different languages that require different sort orders, for example French and Spanish, then you must select one of the sort orders as the default. If you pick, for example, a French sort order, your Spanish users will not see the ch and ll double characters sorted as they would expect. Sort order associated with Unicode data is completely independent of the sort order associated with traditional character data. All relational expressions involving the Unicode datatypes are performed using the Unicode sort order. This includes mixed-mode expressions involving Unicode and non-Unicode data.
The sort order named binary is the most efficient one for unichar data (UTF-16), and is thus the default. This order is based on the Unicode scalar value, meaning that all 32-bit surrogate pairs are placed after all 16-bit Unicode values. The sort order named utf8bin is designed to match the order of the default (most efficient) binary order for UTF-8 char data, namely bin_utf8. The recommended matching combinations are thus binary for unichar and binary for UTF-8 char, or utf8bin for unichar and bin_utf8 for UTF-8 char. The former favors unichar efficiency, while the latter favors char efficiency. Avoid using utf8bin for UTF-8 char, since it is equivalent to bin_utf8 but less efficient.
While the decision to go unicode or to not go unicode is yours to make, if portability and flexibility are important factors for your future development, choose the unicode platform.
Here is what the Field has to say about the unicode debate ………
“When you have two different character sets on the different databases, there will be a character set conversion when synchronization occurs, as is currently the case. Unicode requires 2 bytes to store it characters which means that it can store a wider range of characters. This will not become an issue with your databases until you try to store a character that is not available in your existing character sets, which is an issue that can occur with your existing character sets. Increasing the field lengths of the storage will not help, since each character has a specified storage length for a character set.”
“I would recommend that you change the database that captures new records last, since this will be the database that will allow characters outside of the existing character set range into the system. So if new data is created/captured on the Ultralite DBs, upgrade those last. If the SQL Server captures the new data, then do that last. But, errors will only be encountered when data is attempted to be storage that are characters not defined in the existing character sets being used.”
“From my perspective, I see no reason why one should not start with the assumption that the character set is iso_1 when developing a new system, then figure out if there are reasons why it should not be. Both Replication Server and IQ can handle ISO_1 character sets, so I would also set them up the same way.”
“The issue is more likely to figure out the obstacles introduced by Unicode utilization. I am personally a big supporter of the nchar / unichar family of data types together with a single byte encoding for the char data.”
“My expectation is that – especially fueled by the increased and increasing relevance of the Asian markets – we will commonly deal with Unicode data. By separating technical (ASCII / single byte) character values from real world (Unicode / multibyte) character values, we’ll get the best benefit out of each variant if we’re ready to invest in this distinction. This is likely to become an issue if systems incapable of this (like UL or IQ) join the landscape. UL deployments can be channeled since they are connected to the rest of the infrastructure via a tight channel (MobiLink). IQ should be more seamlessly integrated. As soon as the advantage of the distinction (leaner storage footprint, especially for indexes; fewer comparison / sort order issues) is outweighed by the disadvantages (distinction effort, interface overhead) we may end up with systems generally using multibyte encoding.”
Help is click away……
Follow the “Database Services Content Library” to access the entire series of Database Services Blogs and join the conversation on Twitter @SAPServices
Need help on where to go from here?…Learn more about SAP Database Services here
This author wishes to thank the following people for answering for the Field:
Diether De Coninck, Volker Stoeffler, Tom Slee, Edward Downer
Reference materials for this talk were found in the Systems Administration Guide for the ASE Server version 15.X, volume 1.