Magic Numbers : A Solution to Foreign Characters i...

shayanmajumder

Introduction

While using SAP CPI a consultant often faces a task of sending various types of files to an external SFTP server however more than often just simply defining the file type in the file extension is not enough. There are cases when even though we have the proper extension the file is not properly deciphered by the relevant software. Hence it becomes crucial to hardcode the encoding of the file before it is being sent to the SFTP.

Byte Order Marker (BOM) is used to define encoding and byte order in a file. Usually taking form as an encoded sequence of bytes, BOM aids software in deciphering endianness or byte order for multibyte character encodings such as UTF-16 and UTF-32. The BOMs are also sometimes referred as Magic Numbers which are specific bytes at the beginning of a file that distinguish it as a certain file type. They are also known as file signatures and can help the system identify files even without a file extension.

Advantages of using byte order marker include:

The Byte Order Mark (BOM) plays a crucial role in identifying the character encoding of a text file, especially within Unicode. Given that various encodings may coexist such as UTF-8, UTF-16 and UTF-32; the BOM distinguishes between them an operation particularly useful for those actively working with these different implications of Unicode.
The Byte Order Mark (BOM) serves as a byte order indication for encodings such as UTF-16, where the significance lies in their byte order (endianness). This specific sequence of bytes indicates whether the least significant or most significant byte precedes it.
The inclusion of Byte Order Mark (BOM) can significantly enhance compatibility, particularly in environments where diverse systems or software might interpret text files with variance; thus the use of a BOM helps guarantee that supporting programs will correctly decipher the text file.

Different types of byte order markers

Encoding	Representation (hexadecimal)	Unicode String Format
UTF-8	EF BB BF	\uFEFF
UTF-16, big-endian	FE FF	\uFFFE
UTF-16, little-endian	FF FE	\uFEFF
UTF-32, big-endian	00 00 FE FF	\u0000\u0000\uFEFF
UTF-32, little-endian	FF FE 00 00	\uFEFF\u0000\u0000
UTF-7	2B 2F 76 38 2B 2F 76 39	+/v8+/v9

A comprehensive list of all file magic numbers can be found here.

Using BOM in SAP CPI Groovy Script

Suppose we need to send a simple CSV to a SFTP which contains some Chinese characters. If we do not encode it using BOM and try opening it using excel the output will be shown as:

However when we encode it using BOM using the following code the output will be as follows:

def csvString = "Name,Age,City,ChineseText\nJohn,30,Beijing,你好世界"
csvString = "\uFEFF" + csvString; // New string after adding UTF-8 Byte Order Mark (BOM)

Hence we can see how hardcoding a byte order can help us in dealing with foreign characters and unique encoding styles.

Use of byte markers incase of bidirectional text

Certain integration which involve bidirectional texts involving a mix of both Left to Right and Right to Left directional texts this might require the use of byte markers which make it unidirectional. A good example of such kind of integrations is the Hilan interface which involves a mix of both Hebrew and English alphabets and hence viewing the data effectively becomes very difficult.

Direction	Unicode Byte Marker	Description	Preview (Showing all characters )	Final Display
LTR (Left-to-Right)	\u200E	This marker signals left-to-right text.
RTL (Right-to-Left)	\u200F	This marker signals right-to-left text.
Pop Directional Format	\u202C	The marker terminates an embedding or overrides control by popping the last direction setting.
LRE (Left-to-Right Embedding)	\u202A	This marker indicates that the following text should be treated as an embedded left-to-right block.
RLE (Right-to-Left Embedding)	\u202B	Use this marker to indicate that the following text should be treated as a right-to-left block.
Left-to-Right Override	\u202D	This marker enforces left-to-right direction for the enclosed text, overriding the default right-to-left direction.
Right-to-Left Override	\u202E	This marker enforces right-to-left direction for the enclosed text, overriding the default left-to-right direction.

Conclusion

While the use of BOM can be beneficial in certain situations, it is not always required or preferred. In some instances, such as HTTP responses and scripting languages, including BOMs can cause unforeseen issues. Therefore, it is crucial to evaluate the specific requirements and compatibility of the systems and tools being utilized before deciding to incorporate a BOM in text files. BOM should also be carefully used incase of fixed width files as it introduces additional special characters which might cause an issue with the interpreting software.