How to properly read hana diagnosis files with the...

k137893 · ‎11-16-2020

Hello all. This is my first blog post, so please be kind.

In this blog post, I will give a code example in c# of how to read diagnosis files from hana using sql, and will also document it's shortcoming. I will explain how the inside works on top of explaining how to correctly read the data.

It's pretty technical, I took the time to write this for myself as code comment before translating it in this blog post. Complex code require comment !!!

Feature / Use case

SAP Hana logs some of it's diagnostics data as part of normal files in the linux file system. Files like indexserver_alert_hxehost.trc are example of these diagnostics trace.

These files can be downloaded through ssh, but can also be downloaded or viewed from the hana cockpit (database explorer) or from the hana studio.

The sap hana database also provide a way to read trace file using only sql, simplifying greatly the retrieval of these files. Content of these files can be accessed from the public view M_TRACEFILE_CONTENTS. This is the way both the cockpit and studio uses to view and download the files.

I uses this view to retrieve content of a sql trace file, as part of my sql profiler for sap hana named Winterly SQL Profiler. The sql trace file contains system generated message like a connection was opened, a command was executed, but it also contains user generated content (sql commands). These sql commands may be in any language and can contains special unicode characters. Think of chinese characters, arabic, french accent characters etc. My software need to read the text and stream it; it when new data is added, only retrieve the new data from where you left. Parse the text and present it in a readable way to the user. It read the sql trace file and show the sql commands in real time.

Issue

While, at first sight, retrieving data from M_TRACEFILE_CONTENTS is straightforward. However if you don't do it properly, you end up with corrupted text for anything except the basic latin characters of 7-bit ascii.

Here is the link to the tracefile_contents documentation from help.sap. I think the documentation is lacking, and I lost a lot of time on this thing so I wanted to share my knowledge and help the next developers that need to use this view.

The main issue is that the view has been created to return binary data, and not text. This is apparent with the way the data is splitted, every 1000 bytes, and the column offset, to control where you are in the file position / to seek in the file.

Because this was done to handle binary file, and it return this data like a nvarchar, sap hana scramble the binary data before returning the result to the consumer. This scrambling causes issue with anything that is not standard ascii.

Unfortunately most diagnosis file from the trace folder are byte compatible with ascii, until they are not and you start getting errors........

Explaination

You must start from the fact that this view never return text. It only return and handle binary data.

To be able to return binary data from a view that expose such data as NVARCHAR, the database translate the binary data to character.

Data returned by the database is always binary but encoded as latin-1 to retrieve it as characters. Then these characters are encoded in utf-8 before being given to the sender as part of the nvarchar field.

In the case of a binary file, it's easy to understand that returned chars are unrelated chars. Read a mp3 song from the view, and you'll see text. It's obvious that your mp3 song is not text.

However in the case of a utf-8 encoded text file, there are similarities between utf-8 and latin-1 that makes the basic format compatible. This misleading, as any special character and extended character are NOT compatible. It just so happen that for 7-bit latin ascii character (abc..), their numerical values is the same on utf-8 and ascii and a file will appear appear readable by getting directly the result from the query.

Take for example the binary file we're trying to fetch is a utf-8 text file. It's content are "è 1 A" without space. Let's say for the example that there is no unicode BOM at the beginning of the file.
It's underlying binary data will be 0xc3 0xa8 0x31 0x41.

The data is transposed as a latin-1 encoded text file. Each binary byte 0xc3 0xa8 0x31 0x41 is presented as UNRELATED character Â ¨ 1 A in latin-1 / ISO/CEI 8859-1. Notice that the 1 and A character appear fine, but the special character é has been corrupted.

These unrelated character in latin-1 are converted to utf-8 as they are presented from the database as a nvharchar(1000) and all nvarchar are presented in unicode.
Thus their charcter stay the same Â ¨ 1 A, but their underlying data changes : 0xc3 0x82 0xc2 0xa8 0x31 0x41. Notice that the data has been increased as utf-8 is variable-length.
This data is sent to the sql client as a utf-8 char of "Â ¨ 1 A". Parsing the underlying data as a binary is not enough as that data has already been mangled because of the variable length.

We must do the reverse to get back our original binary data.
First, take the unicode characters and convert them to latin-1. The content Â ¨ 1 A will stay the same but we get back our data 0xc3 0xa8 0x31 0x41
Then with this data, parse it/interpret it as a utf-8, cause we know this binary data is utf-8. You get back your original data of è 1 A.

Unicode issue to be aware of

When decoding the original bytes to utf-8, one must take care of unicode variable length and the fixed splitting of binary data by hana.
Because this was designed to read binary files, the data is presented as columns of 1000 bytes (or less) to the consumer, as if opening a binary in a hex editor.
The data is splitted after 1000 byte by hana regardless if it make sense inside the data. In case of variable length utf-8, one must take care to not parse a row of 1000 bytes
as a complete thing. If the underlying data presented a utf-8 char character that takes more than 1 byte, and this character is splitted as 2 database row,
re-interpreting just the first part or just the second part as a full utf-8 would break that character.
Dotnet has a solution for this, and it is to use Encoding.UTF8.GetDecoder() to change bytes to literal text, instead of Encoding.UTF8.GetString.
Encoding.UTF8.GetString assume the byte array is complete; if it see incomplete data, it will fallback to best-fit, remplacement or exception, the returned text will not be the same as received.
The better solution Decoder.GetChars assume the byte array is a stream of bytes, and if it receive only partial data, it will hold on to the partial data and wait for it's next call to see if the next bytes allows
it to produce a character. So this handle completly the issue of the splitted byte array.

As for the other conversion, hana transpose the binary 1 to 1 to a char, so these is no special care when unscrambling it.
The transposed latin-1 is converted to utf8 and presented to the client, but this conversion is complete in every line/row. So again no need to handle this case.
Thus special care must only be taken when decoding the original data, depending on it's format.

Example code

This show how to properly un-scramble the data returned by the sap hana database, get the file original data, then access it as a utf-8 text file.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.Data.Odbc;

using System.IO;



namespace mytestprogram

{

	class Program

	{

		//Keep the decoder as global var as it needs to keep it's state

		static Decoder utf8decoder = Encoding.UTF8.GetDecoder();



		static void Main(string[] args)

		{

			var conn = HanaOdbcHelper.GetConnection();

			conn.Open();



			var cmd = conn.CreateCommand();

			cmd.CommandText = @"select offset,to_binary(content) as CONTENTBIN from ""PUBLIC"".""M_TRACEFILE_CONTENTS"" where host='xxxxx' and file_name='xxx' and offset>0 order by offset asc";

			var readerTrace = cmd.ExecuteReader();

			while (readerTrace.Read())

			{

				//Get the data as bytes. We could also use Encoding.UTF8.GetBytes if received a string thought this allocate memory that will be discarded. 

				byte[] encodedbinary = readerTrace.GetFieldValue<byte[]>(1);

				//Convert back the encoded binary data from utf-8 to latin-1 to get the original binary value. 

				byte[] rawbinary = Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding("iso-8859-1"), encodedbinary);

				//Once we got our original binary from the file, I assume that all trace file created by hana are in utf-8 thus will decode this binary into a readable char

				char[] decodedchar = new char[utf8decoder.GetCharCount(rawbinary, 0, rawbinary.Length)];

				//Unlike Encoding.GetString(), The "decoder" keep it's state, and if it's presented partial data (unicode surrogate pair) and can't make a full char, it's going to present that char in the next decoding / next pass. 

				utf8decoder.GetChars(rawbinary, 0, rawbinary.Length, decodedchar, 0);

				//Then transform this array of char as a usable string. 

				string currentLine = new string(decodedchar);



				//Do note that console might not be unicode and chars like japanese might not appar correctly in the console, even thought they have been properly decoded in the string. 

				Console.WriteLine(currentLine);

			}



			Console.ReadLine();



			conn.Dispose();

		}

	}

}

Here I applied a shortcut in the sql query : to_binary. Basically, since we need the raw bytes to convert from utf-8 to latin-1, getting the textual representation of utf-8 serves nothing but to slow things down by allocating a useless string and then having to memory collect it.
We could do Encoding.utf8.GetBytes(text) to retrieve the binary, but the hana database has a nice function to do just that. It will directly return utf-8 encoded binary that we can convert to latin-1
without an intermediate string.

You can also see a python example in sap help page. Thought I wonder if it support unicode surrogate pairs - probably not.

Other notes

Congradulation ! Now you've got uncorrupted text. but it's still splitted by 1000 bytes. You need to concatenate it together, handle newlines, and if you're doing text streaming / Live monitoring of trace file like me, handle the offset position properly (in bytes, not # of chars) to retrieve the next data set. This is yet another hurdle that you need to handle.

Final thought / Improvement

I lost so much time on this issue, I hope future developer can save some time with this blog post.

Currently, the way hana provide diagnosis file is really bad because it's misleading. It appears to work until it does not, and you get to search evewhere where is the problem.

M_TRACEFILE_CONTENTS provide a view that could be read from any gui/point and click software, crystal Report editor, report file, a bash script, database management software, etc.
BUT it fail to provide the data as a readable text, without a insane amount of inside knowledge and the feature only a complete programming language can provide to unscramble the data. Thus from my point of view, it fail miserably.

I would suggest to SAP employees to create these alternative instead

Create a new function dedicated to retrieve binary files.

Name it GetTraceBinary that return data of data type varbinary or blob. It's input would be Host, file name, offset and length. by providing the offset and length, you still allow to seek within the file. I would create it as a function, because it's clear that it's not for consuper app like Crystal Report. Developper that use the function can properly get the bytes array and read it as utf-8 or whatever they know. It's not misleading anymore.

Create a new views that is dedicated to read text files.

The new view M_TRACEFILE_TEXT would read the data, and like any good text editor, will automatically detect if it's ascii, utf-8, utf-16. It will provide the real text unlike the current version.

It would then present text separated by new line or by # of character, and not splitted by 1000 bytes. The newline type (windows, unix, mac) would be automatically detected.

As the data would be split per newline, the data would still be streamable for a software like mine.
Select TextContent from M_TRACEFILE_TEXT where ... and linenumber > 10 or where charposition >= 357376.
For me it's very important that the data is streamable / that you can retrieve only part of the file as small reads.

If any of the automatic detection fails (encoding, newline), the developper could still use GetTraceBinary to manually handle the case with full flexibility.

How to properly read hana diagnosis files with the system view M_TRACEFILE_CONTENTS

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z