Hana Smart Data Integration – The File Adapter
The SDI File Adapter is one of the adapters preinstalled with every SDI Agent. Its purpose is to make any file, fixed width or delimited, in every character encoding and all formats available to Hana as a virtual table to select from as easy as from any other Hana table..
Once the File Adapter is made visible in Hana it provides multiple tables out of the box, e.g. FILEDIRECTORY to get a list of all available files or FILECONTENT with file content as one large BLOB. As this might allow any Hana user with the create-remote-source privilege to read files, there needs to be some security built around.
When deploying the adapter in the agent, a root directory is to be set. The adapter does not allow to reach outside of this root directory, ever. Therefore sensible files like /etc/passwd or C:\Windows cannot be read. Second is some password (or AccessToken as it is labeled) to make sure the person creating the remote source in Hana is allowed to use this adapter.
The administrator installing the agent decides on these settings by deploying the FileAdapter and changing the preferences of this adapter in the AgentConfig tool.
As one virtual table specifies the file format definition and multiple files can have the matching structure, there is no 1:1 relationship between files and virtual tables. So what should the adapter return when browsing the list of available virtual tables? To provide this browsing option the adapter does scan a File Format Root Directory for files of name *.cfg and each file contains the format information. This is the second directory to configure.
With this information the remote source and the virtual tables can be created and the file content selected from.
When creating a remote source the root and file format directories are set by default to the directories of the adapter and can be further restricted here. The idea is, maybe one remote source points to the plan data subdirectory, another to the employee data (both sub directories of the adapter root directory) and users are allowed to use either the one or the other remote source. Just in case more fine grained security is needed.
The AccessToken has to match the token entered in the adapter preferences as security measure.
Alternatively the remote source can be created via SQL as well
CREATE REMOTE SOURCE “FileSource” ADAPTER “FileAdapter” AT LOCATION AGENT “agent_local”
CONFIGURATION ‘<?xml version=”1.0″ encoding=”UTF-8″?>
WITH CREDENTIAL TYPE ‘PASSWORD’ USING
The built-in tables
The following tables are provided by the adapter always. These are supposed to help building applications using the File Adapter.
- FILEDIRECTORY: List all files of the remote source, including the sub directories. Supports pushdown of LIKE and = on the PATH and NAME column.
- FILECONTENT: Allows to read a single file as is. It requires the PATH and NAME column to be restricted to a single file using a where clause and returns the entire file content as one BLOB.
- FILECONTENTTEXT: As above but this time the file content is returned as NCLOB. That means the file is read with a specific character encoding. This can either be the default – based on the Byte Order Mark of the data file or the OS default – or provided as dataprovisioning parameter. In this example a ASCII file is read with UTF32 codepage, hence producing wrong characters.
- FILECONTENTROWS: Again returns a BLOB but this time one row per line. A line is supposed to end with a \n (ASCII newline) character/byte.
- CODEPAGES: Returns the list of all supported values for codepages by querying the adapter’s JVM installation.
- LOCALE: Returns the locales the JVM does support.
Defining file formats
One option of creating a virtual table for a given file format is executing the create virtual table statement manually and providing the format information as additional metadata. Below is an example of the syntax and it also lists all possible parameters. Only some of them are mandatory, most can be omitted to leave them at a useful default.
create virtual table “v_plan” at “FileSource”.”<NULL>”.”<NULL>”.”v_plan”
REMOTE PROPERTY ‘dataprovisioning_parameters’=
‘<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
<Parameter name=”DATEFORMAT”>dd. MMMM yyyy</Parameter>
<Parameter name=”SECONDDATEFORMAT”>yyyy.MM.dd HH:mm:ss</Parameter>
<Parameter name=”TIMESTAMPFORMAT”>yyyy.MM.dd HH:mm:ss</Parameter>
<Parameter name=”COLUMN”>PLAN_AMOUNT;decimal (15, 2)</Parameter>
The other option is to create a .cfg file inside the Directory of file format definitions or a sub directory thereof. The content of the file are logically the same. The file has to have a first line with a description and then lists all parameters as key value pairs. Above example as cfg file would look like:
Fileformat to read a CSV file with plan data
DATEFORMAT=dd. MMMM yyyy
COLUMN=PLAN_AMOUNT;decimal (15, 2)
At the end both versions have the same endresult, a virtual table with all these parameters as virtual table attributes. The only difference is with the cfg files the browsing shows something and then the virtual table is created, whereas in the first option the table is created directly.
Note: Changing the cfg file after creating the virtual table has no effect. These parameters are really copied into the Hana data dictionary.
In case the file is a fixed width file, so the column start and end position is always the same, the position information is needed in addition.
Fixed width demo
COLUMN=PLAN_AMOUNT;decimal (15, 2)
Parameters in detail
- FORCE_FILENAME_PATTERN, FORCE_DIRECTORY_PATTERN: A user might execute a simple “select * from virtualtable” without any where clause on directory and name of file. In that case every single file in the root directory and subdirectories will be read and parsed according to this virtual table format definitions. That might take a while a produce many errors. But if the virtual table does match to files in a certain directory, directory tree or to certain file names only, it makes sense to hard wire this information into the virtual table directly. Examples:
- FORCE_DIRECTORY_PATTERN=/usr/sap/FileAdapter/FileServer/plandata: Reads all files in that directory only
- FORCE_DIRECTORY_PATTERN=/usr/sap/FileAdapter/FileServer/plandata/%: Reads all files in that directory and subdirectories
- FORCE_DIRECTORY_PATTERN=/usr/sap/FileAdapter/FileServer/plan%: Reads all files in directories that start with “plan”
- FORCE_FILENAME_PATTERN=plan%.txt: Reads files like plan20120101.txt
- FORCE_DIRECTORY_PATTERN=/usr/sap/FileAdapter/FileServer/plandata FORCE_FILENAME_PATTERN=plan%.txt: Files inside the directory and matching the provided name pattern
- FORMAT: Is either CSV (default) or FIXED
- CODEPAGE: The character encoding the file is read with. By default the operating system default is used. In case the file has a Byte Order Mark this codepage is used always. Valid values of the Java installation can be found by creating a virtual table for CODEPAGE and querying its contents.
- ROW_DELIMITER: A character sequence indicating the end of a row in the file. In case these are non-printable characters they can be provided encoded as \d65536 or \xFFFF or as Unicode notation \u+FFFF. Alternatively the typical \r and \n is supported as well. Examples:
- \n Unix standard
- \r\n Windows standard
- \d13\d10 Windows standard but characters provided as decimal number
- \x0D\x0A Windows standard but characters provided as hex number
- SKIP_HEADER_LINES: In case the file has a header, the number of lines to be skipped is entered here.
- ERROR_ON_COLUMNCOUNT: By default a row with less columns then defined is considered okay. By setting this parameter to true, it is expected that all rows of the file have as many columns as defined.
- LOCALE: The decimal and date conversion from the strings in the file into native numbers or dates might be locale specific. For example the text “3,1415” is a perfectly nice value for PI but in the German notation where the comma is the decimal separator. With an English locale it would be a large integer number 31415. Same thing with dates in case the month or day name is used. “14. Oktober 2000” with locale German is fine, for all other languages it is not. Again, valid values for the locale can be found by querying a virtual table based on the LOCALE table of the adapter.
- DATEFORMAT, TIMEFORMAT, SECONDDATEFORMAT, TIMESTAMPFORMAT: The file format can use these datatypes for the date/time related values. Each can have a different format string. The syntax of the format string is the Java SimpleDateFormat syntax.
- COLUMN: multiple entries each consist of the columnname;datatype where the datatype is any normal Hana datatype.
- COLUMN_DELIMITER: The character sequence indicating the next column. If non-printable characters are used then either one of the encodings will work \d65536, \xFFFF or \u+FFFF
- ; Meaning the ; is the column separator, so a line looks like 2000;IT Costs;435.55
- | using the pipe character as delimiter
- \d09 using an ASCII tab character as delimiter
- TEXT_QUOTES: Sometimes text data is enclosed in quotes so a column delimiter inside the text does not break the format. The line 2000;IT Costs; software related only;435.55 would appear as 4 columns as the text contains a semicolon as well. If the file was created with quotes like 2000;”IT Costs; software related only”;435.55 then there is no such issue but the file parser needs to act more carefully and not just search for the next column delimiter. It needs to check if the text is inside the text quote character or outside.
- ESCAPE_CHAR: Another way to deal with inline special characters is to escape those, like in 2000;IT Costs\; software related only;435.55. Here the \ char is an escape char and indicates that the subsequent character is to be taken literally, not as e.g. column delimiter.
- TEXT_QUOTES_ESCAPE_CHAR: Leaves the question on how to make quotes appear inside the text, like in IT Costs; “software related” only. One option the file creator might have used is to simply use the global escape character: 2000;”IT Costs; \”software related\” only”;435.55. Another popular method is the have the quotes escaped by another quote like in 2000;”IT Costs; “”software related”” only”;435.55. In that case both the TEXT_QUOTE=” and the TEXT_QUOTE_ESCAPE_CHAR=” are set to the ” character.
- QUOTED_TEXT_CONTAIN_ROW_DELIMITER: The default value is false and tells the parser regardless of any quotes or escape characters the text inside a row does never contain the row delimiter character sequence. In this case the parser can break the file into rows much faster, it needs to search for the character sequence only, and only the column parsing has to consider the escape and quote characters. If set to true parsing will be slower.
Fixed Width only
- COLUMNSSTARTENDPOSITION: In a fixed width file the column positions need to be specified for each column. Example:
- 0-3;4-11;12-37;38-53 defines that the first 4 characters are the first column, the next 8 contain the data for the second column etc. Columns need to be in the proper order.
- 0;4;12;38 is equivalent to above example, the last column ends with the line end
- 0;4;12;38-53 can be used as well. In fact every single column can either specify the start and end position or just the start.
- ROWLENGTH: In fixed with files there does not need to be a row delimiter. Often the file has some and then they need to be stripped away. Examples assuming the last data character is at index 53 as specified above:
- ROWLENGTH=56 ROW_DELIMITER= would work for a file that has a row delimiter. The payload text ranges from 0..53 and hence 54 characters long plus two characters for \r\n. But the last column does not contain the \r\n as it is told to end at index 53.
- ROWLENGTH=54 ROW_DELIMITER=\r\n is equivalent to above. Each row is expected to be 54 characters plus 2 characters long. The main advantage of this notation is that COLUMNSSTARTENDPOSITION=0;4;12;38 would work as well as the trailing \r\n is stripped away. In the previous example the the last column would start at 38 but end at index 55 due to rowlength=56 and hence contain the \r\n characters in the last column.
Using the virtual table
When the virtual table is created it has 5 additional columns:
- PATH: The path name of the file being read
- NAME: The file name this row belongs to
- ROWNUMBER: The line number inside the file
- PARTITION: All rows get evenly distributed into 10 partitions (ROWNUMBER modulo 10)
- ERROR: A string with all column parsing errors
Executing a select * from virtualtable will return all rows of all files the virtual table is allowed to read, that is all files or the matching ones based on FORCE_DIRECTORY_PATTERN and FORCE_FILENAME_PATTERN settings. Therefore each row tells what file it belongs to in the PATH and NAME columns. As this text can be quite lengthy, it might be a good idea to not select those columns unless the information is needed of course.
This is actually an important capability of the adapter, it supports projections. Therefore only those columns actually needed are transferred, parsed, converted into the proper datatype. The “top” clause is pushed down to the adapter as well, it would read the first lines and then finish.
The adapter also supports filters on PATH and NAME hence all where clauses with equal, in-lists or like are pushed to the adapter and it can navigate to the matching files directly.
When a parsing error occurs the reading does continue still but the information about all errors in that row are put into the ERROR column. Things like “Cannot convert to seconddate using the provided format and locale (abcd)”. Hence reading the file without checking the ERROR column is dangerous.
In order to simplify finding issues in the file, the line number is returned in the ROWNUMBER column. In above example the file starts with row number 2 as the first line is the header line.
In case the file is very large and has lots of expensive data conversions the virtual table provides a PARTITION column to read a subset of the file only. This can be used to conveniently execute for example two insert…select statements in parallel, each processing half the file.
insert into target select * from v_us_county_census where partition in (0,1,2,3,4);
insert into target select * from v_us_county_census where partition not in (0,1,2,3,4);
Since the flowgraph supports partitioned reading as well, the file adapter works well together with that feature.