Skip to Content
Author's profile photo Jonathan Wesley-James

Best Practices for Migrating to ASE – Should you go unicode?

Lets assume you have downloaded the SAP ASE (Adaptive Server Enterprise) Database Server engine and you are now considering for any new application work, this will be focused on the SAP ASE platform and its supporting products. 

Congratulations and welcome! 

Now we have the Database Engine decision done, we have some decisions to make regarding CPU, RAM, internal structures, etc.  In this blog and these future blog series I will be focusing on answering your questions regarding setting up the SAP ASE Server to avoid minimal hassle and tuning in the future.  We’ll make some assumptions and provide guidance for those of you who are new to this technology.  This article will provide you with the understanding on why choosing the right characer set and sort order is important.

The original problem statement

If you were in charge of a new database application, building a Custom application (referred to as a non-SAP application) and given the fact that newer SAP Applications require unicode, would you want to design something that is unicode by default?

For those of you that are new to unicode

Unicode is a universal encoded character set that enables information from any language to be stored in your Database Server using a single character set.  As the name implies, Unicode provides a unique code value for every character.  This is regardless of the platform, program, or language.  Unicode is an international character set that supports over 650 of the world’s languages, such as Japanese, Chinese, Russian, French, and German. Unicode allows you to mix different languages from different language groups in the same server, no matter what the platform.  

The advantages

The task of designing an application to work outside its country of origin can seem daunting. Often, programmers think that internationalizing means hard-coding dependencies based on cultural and linguistic conventions for just one country.  A better approach is to write an internationalized application: that is, one that examines the local computing environment to determine what language to use and loads files containing language-specific information at runtime.  When you use an internationalized application, a single application can be deployed in all countries. This has several advantages:

  • You write and maintain one application.
  • The application can be deployed, without change, in new countries as needed. You need only supply the correct localization files.
  • All sites can expect standard features and behavior.

Technical drill down

SAP Sybase ASE supported Unicode in the form of three datatypes: unichar, univarchar, and unitext. These datatypes store data in the UTF-16 encoding of Unicode.   The character set model used by ASE is based on a single, configurable, server-wide character set. All data stored in Adaptive Server, using any of the character datatypes (char, varchar, nchar, nvarchar, and text), is interpreted as being in this character set. Sort orders are defined using this character set, as are language modules—collections of server messages translated into local languages.

The character set model used by Adaptive Server is based on a single, configurable, server-wide character set. All data stored in Adaptive Server, using any of the character datatypes is interpreted as being in this character set.   Sort orders are defined using this character set, as are language modules—collections of server messages translated into local languages.

When you initially install or configure your server, you must specify a default character set for the ASE Server. The default character set is the character set in which the server stores and manipulates data.  Each ASE server can have only one default character set.

By default, the ASE  installation tool assumes that the native character set of the platform operating system is the server’s default character set.   If you want to go unicode, this is the spot in the installation where you need to change it.   For example, if you are installing the server on IBM RS/6000 running AIX, and you select one of the Western European languages to install, the installation tool assumes the default character set to be ISO 8859-1.  If you are installing a Unicode server, select UTF–8 as your default character set.  The key is to look for the “UTF” in the drop down list of choices.

Lets assume you are going with UTF – 8 , we are not done just yet, we need to choose a “sort” order.   Why is sort order important?  Consider the following:   if you have users using different languages that require different sort orders, for example French and Spanish, then you must select one of the sort orders as the default. If you pick, for example, a French sort order, your Spanish users will not see the ch and ll double characters sorted as they would expect.   Sort order associated with Unicode data is completely independent of the sort order associated with traditional character data. All relational expressions involving the Unicode datatypes are performed using the Unicode sort order. This includes mixed-mode expressions involving Unicode and non-Unicode data.

The sort order named binary is the most efficient one for unichar data (UTF-16), and is thus the default. This order is based on the Unicode scalar value, meaning that all 32-bit surrogate pairs are placed after all 16-bit Unicode values.  The sort order named utf8bin is designed to match the order of the default (most efficient) binary order for UTF-8 char data, namely bin_utf8. The recommended matching combinations are thus binary for unichar and binary for UTF-8 char, or utf8bin for unichar and bin_utf8 for UTF-8 char. The former favors unichar efficiency, while the latter favors char efficiency.   Avoid using utf8bin for UTF-8 char, since it is equivalent to bin_utf8 but less efficient.

While the decision to go unicode or to not go unicode is yours to make, if portability and flexibility are important factors for your future development, choose the unicode platform.

Here is what the Field has to say about the unicode debate ………

“When you have two different character sets on the different databases, there will be a character set conversion  when synchronization occurs, as is currently the case.  Unicode requires 2 bytes to store it characters which means that it can store a wider range of characters.  This will not become an issue with your databases until you try to store a character that is not available in your existing character sets, which is an issue that can occur with your existing character sets.  Increasing the field lengths of the storage will not help, since each character has a specified storage length for a character set.”

“I would recommend that you change the database that captures new records last, since this will be the database that will allow characters outside of the existing character set range into the system.  So if new data is created/captured on the Ultralite DBs, upgrade those last.  If the SQL Server captures the new data, then do that last.  But, errors will only be encountered when data is attempted to be storage that are characters not defined in the existing character sets being used.”

“From my perspective, I see no reason why one should not start with the assumption that the character set is iso_1 when developing a new system, then figure out if there are reasons why it should not be.  Both Replication Server and IQ can handle ISO_1 character sets, so I would also set them up the same way.”

“The issue is more likely to figure out the obstacles introduced by Unicode utilization.  I am personally a big supporter of the nchar / unichar family of data types together with a single byte encoding for the char data.”

“My expectation is that – especially fueled by the increased and increasing relevance of the Asian markets – we will commonly deal with Unicode data. By separating technical (ASCII / single byte) character values from real world (Unicode / multibyte) character values, we’ll get the best benefit out of each variant if we’re ready to invest in this distinction. This is likely to become an issue if systems incapable of this (like UL or IQ) join the landscape. UL deployments can be channeled since they are connected to the rest of the infrastructure via a tight channel (MobiLink). IQ should be more seamlessly integrated. As soon as the advantage of the distinction (leaner storage footprint, especially for indexes; fewer comparison / sort order issues) is outweighed by the disadvantages (distinction effort, interface overhead) we may end up with systems generally using multibyte encoding.”

Help is click away……

Follow the “Database Services Content Library” to access the entire series of Database Services Blogs and join the conversation on Twitter @SAPServices
Need help on where to go from here?…Learn more about SAP Database Services here

This author wishes to thank the following people for answering for the Field:
Diether De Coninck, Volker Stoeffler, Tom Slee, Edward Downer
Reference materials for this talk were found in the Systems Administration Guide for the ASE Server version 15.X, volume 1.

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Thank you for the informative blog. My understanding is that SAP Sybase ASE supports Unicode encoding UTF-8 as a default charset and Unicode encoding UTF-16 as data types (unichar, univarchar, unitext); on a server configured with UTF-8 as the default charset, data storage falls into the following camps:-

      • All column data of character data types,  char, varchar, nchar, nvarchar, and text, are stored in UTF-8 encoding
        • it uses 1 to 3 bytes per character (4-byte encoding exists but rarely used)
          • 1-byte UTF-8 encoding overlaps with ASCII's first 128 code points
          • 2-byte UTF-8 encoding supports Latin etc.
          • 3-byte UTF-8 encoding supports Chinese Japaanese Korean
        • it is supported by server's conversion between UTF-8 and client character sets
      • The remaining column data of character data types, unichar, univarch and unitext, are stored in UTF-16 encoding
        • it uses 2 bytes or 4 bytes per character
      • server/client exchange of UTF-16 unicode data does not involve any conversion

      You have covered the rationale and advantages for customers to configure servers to use UTF-8 as the server default character set to meet the multi-national business requirements.  Do you have any insights or quotes from the experience in the Field to share with us regarding the costs of switching from a server default native character set to UTF-8?

      For example, I would like to hear some comments on the following potential issues:-

      1. Schema changes are necessary to switch char/varchar/nchar/nvarchar to unichar/univarchar; some application code changes are necessary to handle UTF-16 unicode data due to increased data lengths and absence of server-side conversion;
      2. UTF-8 and UTF-16 encoding requies more disk and memory storage to hold the multi-byte characters;
      3. Data/index rows could push near or spill over page boundary, consequently, it would be sensible to increase server logical page size from 2K to higher;
      4. Process overhead may go up affecting performance genarally e.g.
        • Index maintenance cost would increase even with binary sort order;
        • queries using indexes based on mixing of UTF-8 and UTF-16 encoded columns would result in the optimizer ignoring UTF-8 encoding based indexes (as the UTF-8 encoded data get cast to UTF-16 encoding etc.)
      Author's profile photo Jonathan Wesley-James
      Jonathan Wesley-James
      Blog Post Author

      Thanks Raymond for that extensive reply to my post.  Some excellent points.

      Yes - I agree in some cases where we are using char fields that are near the maximum and we switch to nchar we need to increase the length.  Perhaps now application code changes so much as server-side changes to the physical data model.

      Yes any UTF encoding does require more disk space.  That certainly makes sense.

      From my field experience, the switching of lets say a 2K page to a 4 or 8K page (your point 3) did not lend itself to degrading performance, in some cases the performance was marginally higher and this was most likely due to reducing the physical IO reads.

      I did not see any significant performance going to unicode in the index structures.  Perhaps others can comment.

      While your discussion points seem to indicate the decision is made on an existing and operational ASE server, I believe the unicode decision is perhaps a decision that one would make at the initial stage of designing a system from here on in.